Recurrent Network - 2nd Part

LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. It is an artificial recurrent neural network (RNN) architecture, developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs.

In this lessson, we would introduce several parts below:

Some application of LSTM, eg., Seq2Seq
1.1 Naive implementation of Seqseq translation model
1.2 Naive implementation of Seqseq translation model with attention mechanism
More about LSTM
2.1 Exploring the inner structrue of LSTM (Implement LSTM from scratch using pytorch)
2.2 Comparing LSTM with RNN on change of the grad, when input is a very long sequence
2.3 Observing the forget gate, input gate and output gate of LSTM

This tutorials mainly refer from seq2seq_translation_tutorial and Building an LSTM from Scratch in PyTorch .

Load Necessary modules



In [1]:

    
%load_ext autoreload
%autoreload 2
%matplotlib inline

import random
import math
import time

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from typing import *
from torch.nn import Parameter
from torch.nn import init
from torch import Tensor
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

Since process of data isn't the center part that we should focus in this tutorials, we put data related code in utils function.
Actually, if we want to solve a problem seriously, there is no way for us to skip data processing, which may be boring but very important.



In [2]:

    
from utils import *



In [3]:

    
# Determine to use GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



In [4]:

    
def setup_seed(seed):
    """In order to reproduce the same results
    Args: 
        seed: random seed given by you
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

1.1 Naive implementation of Seqseq translation model

Seq2seq translation model consists of two parts, including the Encoder and Decoder.
Encoder encodes source sentences into fixed vectors for decoder.
Decoder decode fixed vectors into target sentences.

1.1.1 The Encoder

The encoder of a seq2seq network is a LSTM that outputs some value for every word from the input sentence.
For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.



In [5]:

    
class EncoderLSTM(nn.Module):
    """Encoder use LSTM as backbone"""
    def __init__(self, input_size: int, hidden_size: int):
        """
        Args:
            input_size : The number of expected features in the input 
            hidden_size: The number of features in the hidden state 
        """
        super(EncoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        # Retrieve word embeddings with dimentionality hidden_size 
        # using indices with dimentionality input_size, embeddding is learnable
        # After embedding, input vector with input_size would be converted to hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        # LSTM 
        self.lstm = nn.LSTM(hidden_size, hidden_size)
        
    def forward(self, inputs: Tensor, state: Tuple[Tensor]):
        """Forward
        Args:
            inputs: [1, hidden_size]
            state : ([1, 1, hidden_size], [1, 1, hidden_size])
        Returns:
            output:
            state: (hidden, cell)
        """
        (hidden, cell) = state
        # Retrieve word embeddings
        embedded = self.embedding(inputs).view(1, 1, -1)
        # Directly output embedding
        output = embedded
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        return output, (hidden, cell)
    
    def init_hidden(self):
        """Init hidden
        Returns:
            hidden:
            cell:
        """
        cell = torch.zeros(1, 1, self.hidden_size, device=device)
        hidden = torch.zeros(1, 1, self.hidden_size, device=device)
        return hidden, cell

1.1.2 The Decoder

The decoder is another LSTM that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

In the simplest seq2seq decoder we use only last output of the encoder.

This last output is sometimes called the context vector as it encodes context from the entire sequence.

This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state.

The initial input token is the start-of-string token, and the first hidden state is the context vector (the encoder’s last hidden state).



In [6]:

    
class DecoderLSTM(nn.Module):
    """Decoder use LSTM as backbone"""
    def __init__(self, hidden_size: int, output_size: int):
        """
        Args:
            hidden_size: The number of features in the hidden state 
            output_size : The number of expected features in the output 
        """
        super(DecoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        # Retrieve word embeddings with dimentionality hidden_size 
        # using indices with dimentionality input_size, embeddding is learnable
        # After embedding, input vector with input_size would be converted to hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        # LSTM
        self.lstm = nn.LSTM(hidden_size, hidden_size)
        # out
        self.out = nn.Linear(hidden_size, output_size)
        # log after softmax
        self.log_softmax = nn.LogSoftmax(dim=1)
        # activation function
        self.activation_function = F.relu
        
    def forward(self, inputs, state):
        """Forward
        Args:
            inputs: [1, hidden_size]
            state : ([1, 1, hidden_size], [1, 1, hidden_size])
        Returns:
            output:
            state: (hidden, cell)
        """
        (hidden, cell) = state
        # Retrieve word embeddings, [1, 1, hidden_size]
        output = self.embedding(inputs).view(1, 1, -1)
        # activation function, [1, 1, hidden_size]
        output = self.activation_function(output)
        # output: [1, 1, hidden_size]
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        # output: [output_size]
        output = self.log_softmax(self.out(output[0]))
        return output, (hidden, cell)

    def init_hidden(self):
        """Init hidden
        Returns:
            hidden:
            cell:
        """
        cell = torch.zeros(1, 1, self.hidden_size, device=device)
        hidden = torch.zeros(1, 1, self.hidden_size, device=device)
        return hidden, cell

1.1.3 Train and Evaluate



In [7]:

    
def train_by_sentence(input_tensor, target_tensor, encoder, decoder, 
                      encoder_optimizer, decoder_optimizer, loss_fn, 
                      use_teacher_forcing=True, reverse_source_sentence=True,
                      max_length=MAX_LENGTH):
    """Train by single sentence using EncoderLSTM and DecoderLSTM
       including training and update model
    Args:
        input_tensor: [input_sequence_len, 1, hidden_size]
        target_tensor: [target_sequence_len, 1, hidden_size]
        encoder: EncoderLSTM
        decoder: DecoderLSTM
        encoder_optimizer: optimizer for encoder
        decoder_optimizer: optimizer for decoder
        loss_fn: loss function
        use_teacher_forcing: True is to Feed the target as the next input, 
                             False is to use its own predictions as the next input
        max_length: max length for input and output
    Returns:
        loss: scalar
    """
    if reverse_source_sentence:
        input_tensor = torch.flip(input_tensor, [0])
        
    hidden, cell = encoder.init_hidden()

    # Clears the gradients of all optimized torch.Tensors'
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Get sequence length of the input and target sentences.
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
    # encoder outputs:  [max_length, hidden_size]
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # Get encoder outputs
    for ei in range(input_length):
        encoder_output, (hidden, cell) = encoder(
            input_tensor[ei], (hidden, cell))
        encoder_outputs[ei] = encoder_output[0, 0]
    
    # First input for the decoder
    decoder_input = torch.tensor([[SOS_token]], device=device)
    
    # Last state of encoder as the init state of decoder
    decoder_hidden = (hidden, cell)

    for di in range(target_length):
        decoder_output, (hidden, cell) = decoder(
            decoder_input, (hidden, cell))
        
        if use_teacher_forcing:
            # Feed the target as the next input
            loss += loss_fn(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing
        else:
            # Use its own predictions as the next input
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()
            loss += loss_fn(decoder_output, target_tensor[di])

        # End if decoder output End of Signal(EOS)
        if decoder_input.item() == EOS_token:
            break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length



In [8]:

    
def train(encoder, decoder, n_iters, reverse_source_sentence=True, 
          use_teacher_forcing=True,
          print_every=1000, plot_every=100, 
          learning_rate=0.01):
    """Train of Seq2seq
    Args:
        encoder: EncoderLSTM
        decoder: DecoderLSTM
        n_iters: train with n_iters sentences without replacement
        reverse_source_sentence: True is to reverse the source sentence 
                                 but keep order of target unchanged,
                                 False is to keep order of the source sentence 
                                 target unchanged
        use_teacher_forcing: True is to Feed the target as the next input, 
                             False is to use its own predictions as the next input
        print_every: print log every print_every 
        plot_every: plot every plot_every 
        learning_rate: 
        
    """
    
    start = time.time()
    
    plot_losses = []
    print_loss_total = 0
    plot_loss_total = 0

    # Use SGD to optimize encoder and decoder parameters
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    
    # Obtain training input 
    training_pairs = [tensor_from_pair(random.choice(pairs), input_lang, output_lang)
                      for _ in range(n_iters)]
    
    # Negative log likelihood loss
    loss_fn = nn.NLLLoss()

    for i in range(1, n_iters+1):
        # Get a pair of sentences and move them to device, 
        # training_pair: ([Seq_size, 1, input_size], [Seq_size, 1, input_size])
        training_pair = training_pairs[i-1]
        input_tensor = training_pair[0].to(device)
        target_tensor = training_pair[1].to(device)            
            
        # Train by a pair of source sentence and target sentence
        loss = train_by_sentence(input_tensor, target_tensor, 
                                 encoder, decoder,
                                 encoder_optimizer, decoder_optimizer, 
                                 loss_fn, use_teacher_forcing=use_teacher_forcing,
                                 reverse_source_sentence=reverse_source_sentence)
        
        print_loss_total += loss
        plot_loss_total += loss

        if i % print_every == 0:
            # Print Loss
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print("%s (%d %d%%) %.4f" % (time_since(start, i / n_iters),
                                         i, i / n_iters * 100, print_loss_avg))

        if i % plot_every == 0:
            # Plot
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    
    # show plot
    show_plot(plot_losses)



In [9]:

    
def evaluate_by_sentence(encoder, decoder, sentence, reverse_source_sentence, max_length=MAX_LENGTH):
    """Evalutae on a source sentence
    Args:
        encoder
        decoder
        sentence
        max_length
    Return:
        decoded_words: predicted sentence
    """
    with torch.no_grad():
        # Get tensor of sentence
        input_tensor = tensor_from_sentence(input_lang, sentence).to(device)
        input_length = input_tensor.size(0)
        
        if reverse_source_sentence:
            input_tensor = torch.flip(input_tensor, [0])
        
        # init state for encoder
        (hidden, cell) = encoder.init_hidden()

        # encoder outputs: [max_length, hidden_size]
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, (hidden, cell) = encoder(input_tensor[ei],
                                                     (hidden, cell))
            encoder_outputs[ei] += encoder_output[0, 0]
            
        # Last state of encoder as the init state of decoder
        decoder_input = torch.tensor([[SOS_token]], device=device)
        decoder_hidden = (hidden, cell)
        decoded_words = []

        # When evaluate, use its own predictions as the next input
        for di in range(max_length):
            decoder_output, (hidden, cell) = decoder(decoder_input, (hidden, cell))
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append("<EOS>")
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
                
            decoder_input = topi.squeeze().detach()

    return decoded_words



In [10]:

    
def evaluate_randomly(encoder, decoder, n=10, reverse_source_sentence=True):
    """Random pick sentence from dataset and observe the effect of translation
    Args:
        encoder: 
        decoder:
        n: numbers of sentences to evaluate
    """
    for _ in range(n):
        pair = random.choice(pairs)
        # Source sentence
        print(">", pair[0])
        # Target sentence
        print("=", pair[1])
        output_words = evaluate_by_sentence(encoder, decoder, pair[0], reverse_source_sentence)
        output_sentence = " ".join(output_words)
        # Predicted sentence
        print("<", output_sentence)
        print("")



In [11]:

    
def show_plot(points):
    """Plot according to points"""
    plt.figure()
    fig, ax = plt.subplots()
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)
    plt.show()

1.1.4 Let's load data and train

Using prepare_data function to obtain sentences as pairs (source sentence, target sentence).



In [19]:

    
# prepare_data defined in utis.py 
# reverse to True here means, source sentence is English, 
# while target sentence is France
input_lang, output_lang, pairs = prepare_data('eng', 'fra', reverse=True)
print(random.choice(pairs))









    



Reading lines...
Read 135842 sentence pairs
Reverse source sentence
Trimmed to 10599 sentence pairs
Counting words ...
Counting words:
fra 4345
eng 2803
['elle n est pas sans argent .', 'she s not penniless .']



In [14]:

    
setup_seed(45)
hidden_size = 256
# Reverse the order of source input sentence
reverse_source_sentence = True
# Feed the target as the next input
use_teacher_forcing = True
encoder = EncoderLSTM(input_lang.n_words, hidden_size).to(device)
decoder = DecoderLSTM(hidden_size, output_lang.n_words).to(device)
print(">> Model is on: {}".format(next(encoder.parameters()).is_cuda))
print(">> Model is on: {}".format(next(decoder.parameters()).is_cuda))









    



>> Model is on: True
>> Model is on: True



In [15]:

    
iters = 50000
train(encoder, decoder, iters, reverse_source_sentence=reverse_source_sentence, 
      use_teacher_forcing=use_teacher_forcing,print_every=250, plot_every=250)









    



0m 35s (- 117m 13s) (250 0%) 4.7301
0m 50s (- 83m 42s) (500 1%) 3.3269
1m 4s (- 70m 23s) (750 1%) 3.0458
1m 17s (- 63m 34s) (1000 2%) 2.8595
1m 31s (- 59m 25s) (1250 2%) 2.8384
1m 44s (- 56m 23s) (1500 3%) 2.7370
1m 57s (- 53m 52s) (1750 3%) 2.6759
2m 10s (- 52m 1s) (2000 4%) 2.6775
2m 22s (- 50m 27s) (2250 4%) 2.6329
2m 35s (- 49m 16s) (2500 5%) 2.6113
2m 49s (- 48m 24s) (2750 5%) 2.5890
3m 2s (- 47m 33s) (3000 6%) 2.5076
3m 14s (- 46m 42s) (3250 6%) 2.4959
3m 27s (- 45m 57s) (3500 7%) 2.5263
3m 40s (- 45m 15s) (3750 7%) 2.5167
3m 52s (- 44m 37s) (4000 8%) 2.3970
4m 5s (- 44m 2s) (4250 8%) 2.4132
4m 18s (- 43m 29s) (4500 9%) 2.3209
4m 30s (- 42m 57s) (4750 9%) 2.2764
4m 43s (- 42m 28s) (5000 10%) 2.2885
4m 55s (- 42m 0s) (5250 10%) 2.3332
5m 8s (- 41m 36s) (5500 11%) 2.3204
5m 21s (- 41m 11s) (5750 11%) 2.2541
5m 33s (- 40m 49s) (6000 12%) 2.2727
5m 46s (- 40m 27s) (6250 12%) 2.2329
5m 59s (- 40m 4s) (6500 13%) 2.1703
6m 11s (- 39m 41s) (6750 13%) 2.0671
6m 24s (- 39m 19s) (7000 14%) 2.1644
6m 36s (- 38m 59s) (7250 14%) 2.2096
6m 49s (- 38m 39s) (7500 15%) 2.0804
7m 2s (- 38m 20s) (7750 15%) 2.1003
7m 14s (- 38m 1s) (8000 16%) 2.0653
7m 27s (- 37m 43s) (8250 16%) 2.0542
7m 40s (- 37m 26s) (8500 17%) 2.0976
7m 53s (- 37m 10s) (8750 17%) 2.0354
8m 5s (- 36m 52s) (9000 18%) 2.0289
8m 18s (- 36m 36s) (9250 18%) 1.9037
8m 31s (- 36m 18s) (9500 19%) 1.9525
8m 43s (- 36m 0s) (9750 19%) 1.8598
8m 56s (- 35m 45s) (10000 20%) 1.9433
9m 8s (- 35m 28s) (10250 20%) 1.9164
9m 21s (- 35m 11s) (10500 21%) 1.8676
9m 33s (- 34m 54s) (10750 21%) 1.8912
9m 46s (- 34m 39s) (11000 22%) 1.8940
9m 59s (- 34m 23s) (11250 22%) 1.8391
10m 12s (- 34m 8s) (11500 23%) 1.9038
10m 24s (- 33m 54s) (11750 23%) 1.8223
10m 37s (- 33m 38s) (12000 24%) 1.7111
10m 50s (- 33m 23s) (12250 24%) 1.8238
11m 2s (- 33m 8s) (12500 25%) 1.7750
11m 15s (- 32m 53s) (12750 25%) 1.8930
11m 28s (- 32m 38s) (13000 26%) 1.7776
11m 41s (- 32m 24s) (13250 26%) 1.7633
11m 53s (- 32m 9s) (13500 27%) 1.7333
12m 6s (- 31m 54s) (13750 27%) 1.7893
12m 19s (- 31m 40s) (14000 28%) 1.7390
12m 32s (- 31m 26s) (14250 28%) 1.7701
12m 44s (- 31m 12s) (14500 28%) 1.7329
12m 57s (- 30m 58s) (14750 29%) 1.6696
13m 10s (- 30m 44s) (15000 30%) 1.6313
13m 23s (- 30m 29s) (15250 30%) 1.7256
13m 35s (- 30m 15s) (15500 31%) 1.6859
13m 48s (- 30m 1s) (15750 31%) 1.6195
14m 0s (- 29m 46s) (16000 32%) 1.5513
14m 13s (- 29m 32s) (16250 32%) 1.6846
14m 26s (- 29m 18s) (16500 33%) 1.6875
14m 38s (- 29m 4s) (16750 33%) 1.5778
14m 51s (- 28m 50s) (17000 34%) 1.6210
15m 4s (- 28m 36s) (17250 34%) 1.5758
15m 16s (- 28m 22s) (17500 35%) 1.5593
15m 29s (- 28m 9s) (17750 35%) 1.5810
15m 42s (- 27m 55s) (18000 36%) 1.5944
15m 55s (- 27m 41s) (18250 36%) 1.5053
16m 7s (- 27m 28s) (18500 37%) 1.4108
16m 20s (- 27m 14s) (18750 37%) 1.5082
16m 33s (- 27m 0s) (19000 38%) 1.5458
16m 45s (- 26m 46s) (19250 38%) 1.4254
16m 57s (- 26m 32s) (19500 39%) 1.4709
17m 10s (- 26m 17s) (19750 39%) 1.4742
17m 22s (- 26m 3s) (20000 40%) 1.3979
17m 35s (- 25m 50s) (20250 40%) 1.4668
17m 47s (- 25m 36s) (20500 41%) 1.4649
18m 0s (- 25m 22s) (20750 41%) 1.4709
18m 12s (- 25m 9s) (21000 42%) 1.4918
18m 25s (- 24m 56s) (21250 42%) 1.4107
18m 38s (- 24m 42s) (21500 43%) 1.4762
18m 51s (- 24m 29s) (21750 43%) 1.5225
19m 3s (- 24m 15s) (22000 44%) 1.4054
19m 16s (- 24m 2s) (22250 44%) 1.3352
19m 29s (- 23m 49s) (22500 45%) 1.3740
19m 41s (- 23m 35s) (22750 45%) 1.4333
19m 54s (- 23m 22s) (23000 46%) 1.3943
20m 7s (- 23m 9s) (23250 46%) 1.2736
20m 20s (- 22m 55s) (23500 47%) 1.3318
20m 32s (- 22m 42s) (23750 47%) 1.3693
20m 45s (- 22m 28s) (24000 48%) 1.3522
20m 57s (- 22m 15s) (24250 48%) 1.2736
21m 10s (- 22m 2s) (24500 49%) 1.3980
21m 22s (- 21m 48s) (24750 49%) 1.2201
21m 35s (- 21m 35s) (25000 50%) 1.2675
21m 48s (- 21m 22s) (25250 50%) 1.3469
22m 1s (- 21m 9s) (25500 51%) 1.2714
22m 13s (- 20m 56s) (25750 51%) 1.2665
22m 26s (- 20m 42s) (26000 52%) 1.2653
22m 38s (- 20m 29s) (26250 52%) 1.1929
22m 51s (- 20m 16s) (26500 53%) 1.2523
23m 4s (- 20m 3s) (26750 53%) 1.2691
23m 16s (- 19m 49s) (27000 54%) 1.1528
23m 29s (- 19m 36s) (27250 54%) 1.2370
23m 42s (- 19m 23s) (27500 55%) 1.2660
23m 55s (- 19m 10s) (27750 55%) 1.2506
24m 7s (- 18m 57s) (28000 56%) 1.2735
24m 20s (- 18m 44s) (28250 56%) 1.2148
24m 33s (- 18m 31s) (28500 56%) 1.2847
24m 45s (- 18m 18s) (28750 57%) 1.2118
24m 58s (- 18m 5s) (29000 57%) 1.1789
25m 11s (- 17m 52s) (29250 58%) 1.1460
25m 24s (- 17m 39s) (29500 59%) 1.1338
25m 37s (- 17m 26s) (29750 59%) 1.1070
25m 49s (- 17m 13s) (30000 60%) 1.2129
26m 1s (- 16m 59s) (30250 60%) 1.0972
26m 14s (- 16m 46s) (30500 61%) 1.0851
26m 27s (- 16m 34s) (30750 61%) 1.1832
26m 40s (- 16m 21s) (31000 62%) 1.0532
26m 53s (- 16m 8s) (31250 62%) 1.1463
27m 6s (- 15m 55s) (31500 63%) 1.0433
27m 19s (- 15m 42s) (31750 63%) 1.0821
27m 32s (- 15m 29s) (32000 64%) 1.0334
27m 45s (- 15m 16s) (32250 64%) 1.1181
28m 1s (- 15m 5s) (32500 65%) 1.1509
28m 16s (- 14m 53s) (32750 65%) 1.1036
28m 30s (- 14m 40s) (33000 66%) 1.0277
28m 43s (- 14m 28s) (33250 66%) 1.1785
28m 57s (- 14m 15s) (33500 67%) 1.0550
29m 10s (- 14m 2s) (33750 67%) 1.0629
29m 24s (- 13m 50s) (34000 68%) 1.0696
29m 38s (- 13m 37s) (34250 68%) 1.0918
29m 51s (- 13m 24s) (34500 69%) 1.0613
30m 4s (- 13m 11s) (34750 69%) 1.0352
30m 18s (- 12m 59s) (35000 70%) 1.0065
30m 31s (- 12m 46s) (35250 70%) 1.0674
30m 46s (- 12m 34s) (35500 71%) 1.0631
30m 59s (- 12m 21s) (35750 71%) 1.1001
31m 14s (- 12m 8s) (36000 72%) 1.0393
31m 27s (- 11m 55s) (36250 72%) 0.9400
31m 40s (- 11m 42s) (36500 73%) 1.0264
31m 52s (- 11m 29s) (36750 73%) 0.9909
32m 5s (- 11m 16s) (37000 74%) 0.9877
32m 18s (- 11m 3s) (37250 74%) 0.9790
32m 30s (- 10m 50s) (37500 75%) 0.8614
32m 43s (- 10m 37s) (37750 75%) 0.8985
32m 56s (- 10m 24s) (38000 76%) 0.9313
33m 9s (- 10m 11s) (38250 76%) 0.9810
33m 21s (- 9m 57s) (38500 77%) 0.8965
33m 34s (- 9m 44s) (38750 77%) 0.9325
33m 47s (- 9m 31s) (39000 78%) 0.9488
34m 0s (- 9m 18s) (39250 78%) 0.8820
34m 13s (- 9m 5s) (39500 79%) 0.9141
34m 26s (- 8m 52s) (39750 79%) 0.9451
34m 39s (- 8m 39s) (40000 80%) 0.8610
34m 51s (- 8m 26s) (40250 80%) 0.8987
35m 4s (- 8m 13s) (40500 81%) 0.9370
35m 17s (- 8m 0s) (40750 81%) 0.9663
35m 30s (- 7m 47s) (41000 82%) 0.8364
35m 42s (- 7m 34s) (41250 82%) 0.9296
35m 57s (- 7m 21s) (41500 83%) 0.8876
36m 10s (- 7m 8s) (41750 83%) 0.7837
36m 24s (- 6m 56s) (42000 84%) 0.8643
36m 37s (- 6m 43s) (42250 84%) 0.9092
36m 50s (- 6m 30s) (42500 85%) 0.8111
37m 3s (- 6m 17s) (42750 85%) 0.8668
37m 17s (- 6m 4s) (43000 86%) 0.8687
37m 30s (- 5m 51s) (43250 86%) 0.8701
37m 42s (- 5m 38s) (43500 87%) 0.8108
37m 55s (- 5m 25s) (43750 87%) 0.7329
38m 8s (- 5m 12s) (44000 88%) 0.8410
38m 21s (- 4m 59s) (44250 88%) 0.8041
38m 34s (- 4m 46s) (44500 89%) 0.7772
38m 48s (- 4m 33s) (44750 89%) 0.8702
39m 1s (- 4m 20s) (45000 90%) 0.8274
39m 15s (- 4m 7s) (45250 90%) 0.7602
39m 28s (- 3m 54s) (45500 91%) 0.8276
39m 44s (- 3m 41s) (45750 91%) 0.7752
40m 4s (- 3m 29s) (46000 92%) 0.7822
40m 18s (- 3m 16s) (46250 92%) 0.7470
40m 33s (- 3m 3s) (46500 93%) 0.7725
40m 47s (- 2m 50s) (46750 93%) 0.7477
41m 0s (- 2m 37s) (47000 94%) 0.7231
41m 12s (- 2m 23s) (47250 94%) 0.7538
41m 25s (- 2m 10s) (47500 95%) 0.8537
41m 39s (- 1m 57s) (47750 95%) 0.7798
41m 52s (- 1m 44s) (48000 96%) 0.7322
42m 4s (- 1m 31s) (48250 96%) 0.8085
42m 24s (- 1m 18s) (48500 97%) 0.7098
42m 43s (- 1m 5s) (48750 97%) 0.7215
42m 59s (- 0m 52s) (49000 98%) 0.8122
43m 16s (- 0m 39s) (49250 98%) 0.7791
43m 32s (- 0m 26s) (49500 99%) 0.7251
43m 47s (- 0m 13s) (49750 99%) 0.7874
44m 1s (- 0m 0s) (50000 100%) 0.7124






    





<Figure size 432x288 with 0 Axes>



In [16]:

    
# Randomly pick up 10 sentence and observe the performance
evaluate_randomly(encoder, decoder, 10, reverse_source_sentence)









    



> je suis tres fier de nos etudiants .
= i m very proud of our students .
< i m very proud of you . <EOS>

> vous etes faibles .
= you re weak .
< you re rude . <EOS>

> tu n es pas si vieux .
= you re not that old .
< you re not that old . <EOS>

> je songe a demissionner immediatement .
= i am thinking of resigning at once .
< i m thinking about the problem . <EOS>

> je suis en retard sur le programme .
= i m behind schedule .
< i m behind schedule . <EOS>

> je suis submerge de travail .
= i m swamped with work .
< i m proud of that . <EOS>

> je ne vais pas prendre le moindre risque .
= i m not taking any chances .
< i m not taking any chances . <EOS>

> je suis au restaurant .
= i m at the restaurant .
< i m in the office . <EOS>

> c est toi la doyenne .
= you re the oldest .
< you re the oldest . <EOS>

> je suis tres reconnaissant pour votre aide .
= i m very grateful for your help .
< i m very worried about you . <EOS>

作业-1

注意到,Seq2seq的论文中，input sentence的输入是逆序的，实际上本实验课也是如此。按照论文的说法，如果是input sentence是顺序的，模型在同等条件下应该收敛速度可能会更慢。请运行 train 去检验该想法。 (Hint: reverse_source_sentence 控制source sentence是否是逆序输入）

答：通过观察运行的结果，在句子是顺序和逆序两种情况下，实际上收敛速度差不多，没有明显的差距。

注意到, 该课件decoder的输入，既可以是来自于targer sentence也可以是是来自于上一个时刻 decoder的output。请运行 train 去看看有什么差别 (Hint: reverse_source_sentence 控制source sentence是否是逆序输入）

答：可以发现不指定use_teacher_forcing的情况下，开始loss值会比上面模型的loss值要低，但是训练的效果不如上面的模型的训练效果，最终网络模型会比上面的模型略差。

实际上decoder的激活函数除了relu，还可以选用tanh，请改变decoder的激活函数并且运行 train。

答：使用Tanh的一开始收敛速度会快于ReLU，但是最终网络模型的loss值都相差不大。



In [17]:

    
# Hw 1.1

setup_seed(45)
hidden_size = 256
# Reverse the order of source input sentence
reverse_source_sentence = False
# Feed the target as the next input
use_teacher_forcing = True
encoder = EncoderLSTM(input_lang.n_words, hidden_size).to(device)
decoder = DecoderLSTM(hidden_size, output_lang.n_words).to(device)
print(">> Model is on: {}".format(next(encoder.parameters()).is_cuda))
print(">> Model is on: {}".format(next(decoder.parameters()).is_cuda))

iters = 50000
train(encoder, decoder, iters, reverse_source_sentence=reverse_source_sentence, 
      use_teacher_forcing=use_teacher_forcing,print_every=250, plot_every=250)









    



>> Model is on: True
>> Model is on: True
0m 15s (- 49m 58s) (250 0%) 4.6714
0m 28s (- 46m 57s) (500 1%) 3.4858
0m 46s (- 50m 48s) (750 1%) 3.2751
1m 2s (- 51m 9s) (1000 2%) 3.0976
1m 17s (- 50m 26s) (1250 2%) 3.0855
1m 30s (- 48m 57s) (1500 3%) 2.9952
1m 48s (- 49m 56s) (1750 3%) 2.9293
2m 3s (- 49m 23s) (2000 4%) 2.9101
2m 16s (- 48m 26s) (2250 4%) 2.8425
2m 30s (- 47m 34s) (2500 5%) 2.8138
2m 44s (- 47m 5s) (2750 5%) 2.7875
2m 57s (- 46m 25s) (3000 6%) 2.6527
3m 10s (- 45m 44s) (3250 6%) 2.6663
3m 24s (- 45m 10s) (3500 7%) 2.6896
3m 36s (- 44m 34s) (3750 7%) 2.6877
3m 49s (- 44m 4s) (4000 8%) 2.5639
4m 3s (- 43m 41s) (4250 8%) 2.5633
4m 17s (- 43m 19s) (4500 9%) 2.4649
4m 30s (- 42m 56s) (4750 9%) 2.4340
4m 44s (- 42m 40s) (5000 10%) 2.4237
5m 0s (- 42m 40s) (5250 10%) 2.4428
5m 14s (- 42m 24s) (5500 11%) 2.4482
5m 27s (- 42m 3s) (5750 11%) 2.3957
5m 41s (- 41m 41s) (6000 12%) 2.3834
5m 54s (- 41m 23s) (6250 12%) 2.3281
6m 9s (- 41m 12s) (6500 13%) 2.2984
6m 24s (- 41m 5s) (6750 13%) 2.2002
6m 37s (- 40m 44s) (7000 14%) 2.2964
6m 52s (- 40m 30s) (7250 14%) 2.3317
7m 9s (- 40m 34s) (7500 15%) 2.2087
7m 24s (- 40m 24s) (7750 15%) 2.2318
7m 39s (- 40m 10s) (8000 16%) 2.1741
7m 52s (- 39m 52s) (8250 16%) 2.1802
8m 7s (- 39m 37s) (8500 17%) 2.2237
8m 21s (- 39m 26s) (8750 17%) 2.1596
8m 36s (- 39m 12s) (9000 18%) 2.1561
8m 49s (- 38m 54s) (9250 18%) 2.0127
9m 4s (- 38m 39s) (9500 19%) 2.0805
9m 17s (- 38m 20s) (9750 19%) 1.9760
9m 31s (- 38m 4s) (10000 20%) 2.0731
9m 44s (- 37m 48s) (10250 20%) 2.0243
9m 59s (- 37m 34s) (10500 21%) 1.9722
10m 13s (- 37m 19s) (10750 21%) 2.0133
10m 27s (- 37m 5s) (11000 22%) 2.0114
10m 40s (- 36m 46s) (11250 22%) 1.9371
10m 54s (- 36m 29s) (11500 23%) 2.0084
11m 8s (- 36m 14s) (11750 23%) 1.9365
11m 21s (- 35m 58s) (12000 24%) 1.8248
11m 34s (- 35m 41s) (12250 24%) 1.9471
11m 48s (- 35m 25s) (12500 25%) 1.8760
12m 2s (- 35m 11s) (12750 25%) 2.0061
12m 16s (- 34m 55s) (13000 26%) 1.8975
12m 29s (- 34m 39s) (13250 26%) 1.8681
12m 43s (- 34m 23s) (13500 27%) 1.8666
12m 56s (- 34m 7s) (13750 27%) 1.8889
13m 9s (- 33m 51s) (14000 28%) 1.8494
13m 23s (- 33m 35s) (14250 28%) 1.8718
13m 36s (- 33m 19s) (14500 28%) 1.8546
13m 49s (- 33m 2s) (14750 29%) 1.7775
14m 4s (- 32m 49s) (15000 30%) 1.7341
14m 17s (- 32m 34s) (15250 30%) 1.8543
14m 31s (- 32m 20s) (15500 31%) 1.8051
14m 47s (- 32m 10s) (15750 31%) 1.7052
15m 2s (- 31m 58s) (16000 32%) 1.6546
15m 16s (- 31m 44s) (16250 32%) 1.7894
15m 30s (- 31m 29s) (16500 33%) 1.7909
15m 43s (- 31m 13s) (16750 33%) 1.6812
15m 57s (- 30m 59s) (17000 34%) 1.7208
16m 12s (- 30m 46s) (17250 34%) 1.6634
16m 26s (- 30m 32s) (17500 35%) 1.6423
16m 40s (- 30m 18s) (17750 35%) 1.6812
16m 54s (- 30m 4s) (18000 36%) 1.6888
17m 8s (- 29m 49s) (18250 36%) 1.6100
17m 23s (- 29m 35s) (18500 37%) 1.5059
17m 37s (- 29m 22s) (18750 37%) 1.5959
17m 51s (- 29m 7s) (19000 38%) 1.6546
18m 5s (- 28m 53s) (19250 38%) 1.5423
18m 20s (- 28m 41s) (19500 39%) 1.5616
18m 35s (- 28m 27s) (19750 39%) 1.5739
18m 49s (- 28m 14s) (20000 40%) 1.5041
19m 3s (- 28m 0s) (20250 40%) 1.5423
19m 16s (- 27m 44s) (20500 41%) 1.5468
19m 31s (- 27m 30s) (20750 41%) 1.5539
19m 44s (- 27m 15s) (21000 42%) 1.5784
19m 57s (- 27m 0s) (21250 42%) 1.5044
20m 10s (- 26m 45s) (21500 43%) 1.5461
20m 23s (- 26m 29s) (21750 43%) 1.6279
20m 35s (- 26m 12s) (22000 44%) 1.4824
20m 48s (- 25m 57s) (22250 44%) 1.4165
21m 1s (- 25m 42s) (22500 45%) 1.4686
21m 14s (- 25m 26s) (22750 45%) 1.4966
21m 27s (- 25m 11s) (23000 46%) 1.4789
21m 41s (- 24m 57s) (23250 46%) 1.3615
21m 55s (- 24m 43s) (23500 47%) 1.4052
22m 10s (- 24m 30s) (23750 47%) 1.4511
22m 24s (- 24m 16s) (24000 48%) 1.4402
22m 37s (- 24m 1s) (24250 48%) 1.3447
22m 50s (- 23m 46s) (24500 49%) 1.4752
23m 3s (- 23m 31s) (24750 49%) 1.3075
23m 16s (- 23m 16s) (25000 50%) 1.3571
23m 29s (- 23m 1s) (25250 50%) 1.4438
23m 42s (- 22m 46s) (25500 51%) 1.3615
23m 55s (- 22m 31s) (25750 51%) 1.3481
24m 8s (- 22m 17s) (26000 52%) 1.3252
24m 21s (- 22m 2s) (26250 52%) 1.2551
24m 34s (- 21m 47s) (26500 53%) 1.3232
24m 46s (- 21m 32s) (26750 53%) 1.3265
24m 59s (- 21m 17s) (27000 54%) 1.2293
25m 12s (- 21m 2s) (27250 54%) 1.3035
25m 25s (- 20m 48s) (27500 55%) 1.3292
25m 39s (- 20m 34s) (27750 55%) 1.3038
25m 52s (- 20m 20s) (28000 56%) 1.3342
26m 6s (- 20m 6s) (28250 56%) 1.2919
26m 20s (- 19m 52s) (28500 56%) 1.3521
26m 33s (- 19m 38s) (28750 57%) 1.2852
26m 49s (- 19m 25s) (29000 57%) 1.2585
27m 4s (- 19m 12s) (29250 58%) 1.2053
27m 19s (- 18m 59s) (29500 59%) 1.2100
27m 35s (- 18m 47s) (29750 59%) 1.1837
27m 51s (- 18m 34s) (30000 60%) 1.2855
28m 4s (- 18m 19s) (30250 60%) 1.1795
28m 17s (- 18m 5s) (30500 61%) 1.1474
28m 32s (- 17m 52s) (30750 61%) 1.2269
28m 46s (- 17m 38s) (31000 62%) 1.1390
28m 59s (- 17m 23s) (31250 62%) 1.1778
29m 12s (- 17m 9s) (31500 63%) 1.1195
29m 28s (- 16m 56s) (31750 63%) 1.1402
29m 41s (- 16m 42s) (32000 64%) 1.0976
29m 55s (- 16m 28s) (32250 64%) 1.1781
30m 11s (- 16m 15s) (32500 65%) 1.1969
30m 25s (- 16m 1s) (32750 65%) 1.1784
30m 40s (- 15m 48s) (33000 66%) 1.1126
30m 54s (- 15m 34s) (33250 66%) 1.2442
31m 8s (- 15m 20s) (33500 67%) 1.1424
31m 22s (- 15m 6s) (33750 67%) 1.1265
31m 38s (- 14m 53s) (34000 68%) 1.1299
31m 51s (- 14m 38s) (34250 68%) 1.1325
32m 4s (- 14m 24s) (34500 69%) 1.0950
32m 17s (- 14m 10s) (34750 69%) 1.0889
32m 30s (- 13m 55s) (35000 70%) 1.0644
32m 43s (- 13m 41s) (35250 70%) 1.1237
32m 56s (- 13m 27s) (35500 71%) 1.1032
33m 9s (- 13m 12s) (35750 71%) 1.1367
33m 22s (- 12m 58s) (36000 72%) 1.1039
33m 36s (- 12m 44s) (36250 72%) 1.0013
33m 49s (- 12m 30s) (36500 73%) 1.1018
34m 2s (- 12m 16s) (36750 73%) 1.0291
34m 17s (- 12m 2s) (37000 74%) 1.0513
34m 30s (- 11m 48s) (37250 74%) 1.0345
34m 44s (- 11m 34s) (37500 75%) 0.9229
34m 58s (- 11m 21s) (37750 75%) 0.9472
35m 12s (- 11m 7s) (38000 76%) 0.9795
35m 27s (- 10m 53s) (38250 76%) 1.0625
35m 42s (- 10m 39s) (38500 77%) 0.9455
35m 56s (- 10m 26s) (38750 77%) 0.9808
36m 10s (- 10m 12s) (39000 78%) 1.0044
36m 22s (- 9m 57s) (39250 78%) 0.9169
36m 36s (- 9m 43s) (39500 79%) 0.9849
36m 50s (- 9m 29s) (39750 79%) 0.9961
37m 3s (- 9m 15s) (40000 80%) 0.9106
37m 17s (- 9m 2s) (40250 80%) 0.9502
37m 30s (- 8m 47s) (40500 81%) 0.9925
37m 44s (- 8m 34s) (40750 81%) 0.9969
37m 58s (- 8m 20s) (41000 82%) 0.9025
38m 12s (- 8m 6s) (41250 82%) 0.9700
38m 26s (- 7m 52s) (41500 83%) 0.9398
38m 40s (- 7m 38s) (41750 83%) 0.8147
38m 54s (- 7m 24s) (42000 84%) 0.9346
39m 8s (- 7m 10s) (42250 84%) 0.9792
39m 22s (- 6m 56s) (42500 85%) 0.8348
39m 37s (- 6m 43s) (42750 85%) 0.9208
39m 52s (- 6m 29s) (43000 86%) 0.9285
40m 8s (- 6m 15s) (43250 86%) 0.9242
40m 21s (- 6m 1s) (43500 87%) 0.8695
40m 34s (- 5m 47s) (43750 87%) 0.7792
40m 49s (- 5m 34s) (44000 88%) 0.8718
41m 4s (- 5m 20s) (44250 88%) 0.8257
41m 18s (- 5m 6s) (44500 89%) 0.8202
41m 31s (- 4m 52s) (44750 89%) 0.9108
41m 46s (- 4m 38s) (45000 90%) 0.8588
42m 2s (- 4m 24s) (45250 90%) 0.7951
42m 16s (- 4m 10s) (45500 91%) 0.8723
42m 30s (- 3m 56s) (45750 91%) 0.8268
42m 45s (- 3m 43s) (46000 92%) 0.8319
42m 59s (- 3m 29s) (46250 92%) 0.7746
43m 13s (- 3m 15s) (46500 93%) 0.7857
43m 28s (- 3m 1s) (46750 93%) 0.7949
43m 43s (- 2m 47s) (47000 94%) 0.7646
43m 58s (- 2m 33s) (47250 94%) 0.7716
44m 13s (- 2m 19s) (47500 95%) 0.9043
44m 27s (- 2m 5s) (47750 95%) 0.8049
44m 42s (- 1m 51s) (48000 96%) 0.7793
44m 59s (- 1m 37s) (48250 96%) 0.8312
45m 14s (- 1m 23s) (48500 97%) 0.7354
45m 29s (- 1m 9s) (48750 97%) 0.7525
45m 43s (- 0m 55s) (49000 98%) 0.8443
45m 57s (- 0m 41s) (49250 98%) 0.8365
46m 10s (- 0m 27s) (49500 99%) 0.7765
46m 22s (- 0m 13s) (49750 99%) 0.8225
46m 35s (- 0m 0s) (50000 100%) 0.7427






    





<Figure size 432x288 with 0 Axes>



In [18]:

    
# Hw 1.2

setup_seed(45)
hidden_size = 256
# Reverse the order of source input sentence
reverse_source_sentence = True
# Feed the target as the next input
use_teacher_forcing = False
encoder = EncoderLSTM(input_lang.n_words, hidden_size).to(device)
decoder = DecoderLSTM(hidden_size, output_lang.n_words).to(device)
print(">> Model is on: {}".format(next(encoder.parameters()).is_cuda))
print(">> Model is on: {}".format(next(decoder.parameters()).is_cuda))

iters = 50000
train(encoder, decoder, iters, reverse_source_sentence=reverse_source_sentence, 
      use_teacher_forcing=use_teacher_forcing,print_every=250, plot_every=250)









    



>> Model is on: True
>> Model is on: True
0m 14s (- 49m 22s) (250 0%) 4.5025
0m 27s (- 45m 11s) (500 1%) 3.3029
0m 39s (- 43m 14s) (750 1%) 2.9893
0m 52s (- 42m 33s) (1000 2%) 2.8886
1m 6s (- 43m 10s) (1250 2%) 2.9643
1m 21s (- 43m 53s) (1500 3%) 2.8594
1m 35s (- 43m 40s) (1750 3%) 2.8643
1m 49s (- 43m 48s) (2000 4%) 2.7976
2m 2s (- 43m 24s) (2250 4%) 2.8458
2m 15s (- 42m 54s) (2500 5%) 2.8655
2m 29s (- 42m 54s) (2750 5%) 2.8859
2m 43s (- 42m 40s) (3000 6%) 2.7476
2m 57s (- 42m 38s) (3250 6%) 2.7577
3m 11s (- 42m 27s) (3500 7%) 2.7986
3m 26s (- 42m 22s) (3750 7%) 2.7935
3m 41s (- 42m 22s) (4000 8%) 2.7148
3m 56s (- 42m 20s) (4250 8%) 2.6984
4m 10s (- 42m 14s) (4500 9%) 2.6851
4m 26s (- 42m 14s) (4750 9%) 2.6087
4m 40s (- 42m 0s) (5000 10%) 2.5876
4m 53s (- 41m 42s) (5250 10%) 2.7120
5m 7s (- 41m 29s) (5500 11%) 2.6891
5m 21s (- 41m 13s) (5750 11%) 2.6112
5m 36s (- 41m 8s) (6000 12%) 2.6307
5m 52s (- 41m 5s) (6250 12%) 2.5999
6m 7s (- 40m 58s) (6500 13%) 2.5548
6m 20s (- 40m 39s) (6750 13%) 2.4575
6m 34s (- 40m 23s) (7000 14%) 2.5242
6m 49s (- 40m 11s) (7250 14%) 2.5643
7m 3s (- 39m 58s) (7500 15%) 2.4702
7m 18s (- 39m 51s) (7750 15%) 2.4989
7m 33s (- 39m 43s) (8000 16%) 2.4834
7m 50s (- 39m 40s) (8250 16%) 2.4431
8m 4s (- 39m 27s) (8500 17%) 2.5258
8m 20s (- 39m 19s) (8750 17%) 2.4261
8m 34s (- 39m 3s) (9000 18%) 2.4441
8m 49s (- 38m 51s) (9250 18%) 2.2770
9m 2s (- 38m 33s) (9500 19%) 2.3535
9m 15s (- 38m 15s) (9750 19%) 2.2491
9m 29s (- 37m 58s) (10000 20%) 2.3741
9m 43s (- 37m 41s) (10250 20%) 2.2997
9m 56s (- 37m 23s) (10500 21%) 2.2575
10m 10s (- 37m 9s) (10750 21%) 2.2566
10m 25s (- 36m 58s) (11000 22%) 2.2933
10m 40s (- 36m 44s) (11250 22%) 2.2505
10m 55s (- 36m 34s) (11500 23%) 2.3371
11m 11s (- 36m 27s) (11750 23%) 2.3011
11m 25s (- 36m 10s) (12000 24%) 2.0989
11m 38s (- 35m 52s) (12250 24%) 2.2465
11m 52s (- 35m 36s) (12500 25%) 2.2069
12m 6s (- 35m 21s) (12750 25%) 2.3457
12m 20s (- 35m 8s) (13000 26%) 2.2645
12m 37s (- 34m 59s) (13250 26%) 2.1701
12m 51s (- 34m 47s) (13500 27%) 2.1677
13m 6s (- 34m 33s) (13750 27%) 2.1980
13m 21s (- 34m 21s) (14000 28%) 2.2050
13m 37s (- 34m 10s) (14250 28%) 2.2295
13m 50s (- 33m 54s) (14500 28%) 2.1602
14m 3s (- 33m 36s) (14750 29%) 2.1159
14m 17s (- 33m 20s) (15000 30%) 2.0797
14m 31s (- 33m 6s) (15250 30%) 2.1747
14m 45s (- 32m 51s) (15500 31%) 2.1242
14m 59s (- 32m 35s) (15750 31%) 1.9999
15m 13s (- 32m 20s) (16000 32%) 2.0295
15m 26s (- 32m 5s) (16250 32%) 2.1235
15m 41s (- 31m 52s) (16500 33%) 2.1474
15m 56s (- 31m 38s) (16750 33%) 2.0056
16m 11s (- 31m 26s) (17000 34%) 2.1091
16m 26s (- 31m 13s) (17250 34%) 2.0094
16m 41s (- 31m 0s) (17500 35%) 2.0407
16m 55s (- 30m 45s) (17750 35%) 2.1027
17m 11s (- 30m 33s) (18000 36%) 2.0396
17m 25s (- 30m 19s) (18250 36%) 1.9195
17m 40s (- 30m 4s) (18500 37%) 1.8826
17m 54s (- 29m 50s) (18750 37%) 1.9585
18m 8s (- 29m 36s) (19000 38%) 1.9856
18m 21s (- 29m 19s) (19250 38%) 1.8507
18m 35s (- 29m 5s) (19500 39%) 1.9137
18m 49s (- 28m 50s) (19750 39%) 1.8913
19m 3s (- 28m 34s) (20000 40%) 1.8300
19m 17s (- 28m 20s) (20250 40%) 1.9194
19m 32s (- 28m 6s) (20500 41%) 1.8994
19m 46s (- 27m 52s) (20750 41%) 1.9269
19m 59s (- 27m 36s) (21000 42%) 1.9591
20m 12s (- 27m 21s) (21250 42%) 1.8543
20m 27s (- 27m 7s) (21500 43%) 1.9058
20m 41s (- 26m 52s) (21750 43%) 1.9672
20m 54s (- 26m 37s) (22000 44%) 1.8047
21m 9s (- 26m 23s) (22250 44%) 1.7742
21m 23s (- 26m 9s) (22500 45%) 1.8339
21m 37s (- 25m 54s) (22750 45%) 1.8410
21m 51s (- 25m 39s) (23000 46%) 1.8805
22m 4s (- 25m 24s) (23250 46%) 1.7419
22m 19s (- 25m 10s) (23500 47%) 1.7621
22m 33s (- 24m 56s) (23750 47%) 1.8105
22m 48s (- 24m 42s) (24000 48%) 1.7943
23m 4s (- 24m 29s) (24250 48%) 1.6676
23m 20s (- 24m 17s) (24500 49%) 1.8287
23m 35s (- 24m 4s) (24750 49%) 1.6563
23m 49s (- 23m 49s) (25000 50%) 1.7273
24m 4s (- 23m 36s) (25250 50%) 1.8329
24m 19s (- 23m 22s) (25500 51%) 1.7469
24m 33s (- 23m 8s) (25750 51%) 1.7384
24m 47s (- 22m 52s) (26000 52%) 1.6652
25m 0s (- 22m 37s) (26250 52%) 1.6037
25m 14s (- 22m 23s) (26500 53%) 1.7191
25m 29s (- 22m 9s) (26750 53%) 1.6973
25m 43s (- 21m 55s) (27000 54%) 1.6083
25m 58s (- 21m 40s) (27250 54%) 1.7156
26m 13s (- 21m 27s) (27500 55%) 1.7280
26m 28s (- 21m 13s) (27750 55%) 1.7159
26m 44s (- 21m 0s) (28000 56%) 1.7114
26m 59s (- 20m 46s) (28250 56%) 1.6274
27m 12s (- 20m 31s) (28500 56%) 1.7392
27m 26s (- 20m 16s) (28750 57%) 1.6450
27m 39s (- 20m 1s) (29000 57%) 1.6486
27m 55s (- 19m 48s) (29250 58%) 1.5500
28m 9s (- 19m 33s) (29500 59%) 1.5396
28m 23s (- 19m 19s) (29750 59%) 1.5874
28m 37s (- 19m 4s) (30000 60%) 1.6781
28m 50s (- 18m 50s) (30250 60%) 1.5464
29m 4s (- 18m 35s) (30500 61%) 1.5097
29m 17s (- 18m 20s) (30750 61%) 1.6336
29m 31s (- 18m 5s) (31000 62%) 1.4506
29m 44s (- 17m 50s) (31250 62%) 1.5574
29m 57s (- 17m 35s) (31500 63%) 1.4673
30m 11s (- 17m 21s) (31750 63%) 1.5363
30m 24s (- 17m 6s) (32000 64%) 1.4984
30m 39s (- 16m 52s) (32250 64%) 1.5828
30m 52s (- 16m 37s) (32500 65%) 1.5599
31m 6s (- 16m 23s) (32750 65%) 1.5457
31m 20s (- 16m 8s) (33000 66%) 1.4696
31m 36s (- 15m 55s) (33250 66%) 1.5849
31m 50s (- 15m 41s) (33500 67%) 1.4862
32m 4s (- 15m 26s) (33750 67%) 1.5384
32m 18s (- 15m 12s) (34000 68%) 1.5472
32m 31s (- 14m 57s) (34250 68%) 1.5177
32m 44s (- 14m 42s) (34500 69%) 1.4774
32m 58s (- 14m 28s) (34750 69%) 1.5311
33m 12s (- 14m 13s) (35000 70%) 1.4315
33m 26s (- 13m 59s) (35250 70%) 1.5333
33m 40s (- 13m 45s) (35500 71%) 1.5042
33m 54s (- 13m 30s) (35750 71%) 1.5169
34m 8s (- 13m 16s) (36000 72%) 1.4865
34m 21s (- 13m 2s) (36250 72%) 1.4325
34m 35s (- 12m 47s) (36500 73%) 1.4366
34m 48s (- 12m 33s) (36750 73%) 1.3897
35m 2s (- 12m 18s) (37000 74%) 1.4056
35m 15s (- 12m 4s) (37250 74%) 1.3767
35m 29s (- 11m 49s) (37500 75%) 1.2663
35m 44s (- 11m 35s) (37750 75%) 1.2585
35m 59s (- 11m 21s) (38000 76%) 1.3711
36m 14s (- 11m 7s) (38250 76%) 1.4283
36m 29s (- 10m 54s) (38500 77%) 1.2946
36m 44s (- 10m 39s) (38750 77%) 1.3490
36m 58s (- 10m 25s) (39000 78%) 1.3680
37m 13s (- 10m 11s) (39250 78%) 1.3051
37m 27s (- 9m 57s) (39500 79%) 1.3372
37m 43s (- 9m 43s) (39750 79%) 1.3481
37m 56s (- 9m 29s) (40000 80%) 1.2466
38m 11s (- 9m 15s) (40250 80%) 1.3027
38m 24s (- 9m 0s) (40500 81%) 1.3294
38m 38s (- 8m 46s) (40750 81%) 1.3335
38m 52s (- 8m 32s) (41000 82%) 1.3182
39m 6s (- 8m 17s) (41250 82%) 1.2889
39m 20s (- 8m 3s) (41500 83%) 1.2759
39m 34s (- 7m 49s) (41750 83%) 1.1475
39m 48s (- 7m 34s) (42000 84%) 1.3096
40m 2s (- 7m 20s) (42250 84%) 1.3623
40m 16s (- 7m 6s) (42500 85%) 1.1836
40m 29s (- 6m 52s) (42750 85%) 1.2626
40m 43s (- 6m 37s) (43000 86%) 1.3089
40m 57s (- 6m 23s) (43250 86%) 1.3444
41m 10s (- 6m 9s) (43500 87%) 1.1942
41m 24s (- 5m 54s) (43750 87%) 1.1610
41m 37s (- 5m 40s) (44000 88%) 1.2403
41m 51s (- 5m 26s) (44250 88%) 1.2399
42m 4s (- 5m 11s) (44500 89%) 1.1469
42m 17s (- 4m 57s) (44750 89%) 1.2939
42m 31s (- 4m 43s) (45000 90%) 1.1891
42m 45s (- 4m 29s) (45250 90%) 1.1746
43m 0s (- 4m 15s) (45500 91%) 1.2312
43m 15s (- 4m 1s) (45750 91%) 1.1844
43m 30s (- 3m 46s) (46000 92%) 1.2206
43m 45s (- 3m 32s) (46250 92%) 1.1198
44m 0s (- 3m 18s) (46500 93%) 1.1725
44m 15s (- 3m 4s) (46750 93%) 1.1767
44m 30s (- 2m 50s) (47000 94%) 1.0955
44m 45s (- 2m 36s) (47250 94%) 1.1346
45m 1s (- 2m 22s) (47500 95%) 1.2594
45m 14s (- 2m 7s) (47750 95%) 1.1463
45m 28s (- 1m 53s) (48000 96%) 1.0840
45m 41s (- 1m 39s) (48250 96%) 1.2354
45m 54s (- 1m 25s) (48500 97%) 1.1347
46m 8s (- 1m 10s) (48750 97%) 1.1060
46m 22s (- 0m 56s) (49000 98%) 1.1978
46m 36s (- 0m 42s) (49250 98%) 1.1906
46m 50s (- 0m 28s) (49500 99%) 1.0553
47m 3s (- 0m 14s) (49750 99%) 1.1334
47m 18s (- 0m 0s) (50000 100%) 1.0692






    





<Figure size 432x288 with 0 Axes>



In [21]:

    
# Hw 1.3
# TODO: change activation of DecoderLSTM firstly

class DecoderLSTM_v2(nn.Module):
    """Decoder use LSTM as backbone"""
    def __init__(self, hidden_size: int, output_size: int):
        """
        Args:
            hidden_size: The number of features in the hidden state 
            output_size : The number of expected features in the output 
        """
        super(DecoderLSTM_v2, self).__init__()
        self.hidden_size = hidden_size
        # Retrieve word embeddings with dimentionality hidden_size 
        # using indices with dimentionality input_size, embeddding is learnable
        # After embedding, input vector with input_size would be converted to hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        # LSTM
        self.lstm = nn.LSTM(hidden_size, hidden_size)
        # out
        self.out = nn.Linear(hidden_size, output_size)
        # log after softmax
        self.log_softmax = nn.LogSoftmax(dim=1)
        # activation function, TODO!!
        self.activation_function = torch.tanh
        
    def forward(self, inputs, state):
        """Forward
        Args:
            inputs: [1, hidden_size]
            state : ([1, 1, hidden_size], [1, 1, hidden_size])
        Returns:
            output:
            state: (hidden, cell)
        """
        (hidden, cell) = state
        # Retrieve word embeddings, [1, 1, hidden_size]
        output = self.embedding(inputs).view(1, 1, -1)
        # activation function, [1, 1, hidden_size]
        output = self.activation_function(output)
        # output: [1, 1, hidden_size]
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        # output: [output_size]
        output = self.log_softmax(self.out(output[0]))
        return output, (hidden, cell)

    def init_hidden(self):
        """Init hidden
        Returns:
            hidden:
            cell:
        """
        cell = torch.zeros(1, 1, self.hidden_size, device=device)
        hidden = torch.zeros(1, 1, self.hidden_size, device=device)
        return hidden, cell



In [22]:

    
setup_seed(45)
hidden_size = 256
# Reverse the order of source input sentence
reverse_source_sentence = True
# Feed the target as the next input
use_teacher_forcing = True
encoder = EncoderLSTM(input_lang.n_words, hidden_size).to(device)
decoder = DecoderLSTM_v2(hidden_size, output_lang.n_words).to(device)
print(">> Model is on: {}".format(next(encoder.parameters()).is_cuda))
print(">> Model is on: {}".format(next(decoder.parameters()).is_cuda))

iters = 50000
train(encoder, decoder, iters, reverse_source_sentence=reverse_source_sentence, 
      use_teacher_forcing=use_teacher_forcing,print_every=250, plot_every=250)









    



>> Model is on: True
>> Model is on: True
0m 15s (- 52m 49s) (250 0%) 5.4118
0m 29s (- 48m 55s) (500 1%) 3.4440
0m 42s (- 46m 57s) (750 1%) 3.1115
0m 55s (- 45m 34s) (1000 2%) 2.9068
1m 9s (- 45m 4s) (1250 2%) 2.8706
1m 22s (- 44m 33s) (1500 3%) 2.7711
1m 35s (- 43m 56s) (1750 3%) 2.7297
1m 48s (- 43m 27s) (2000 4%) 2.7282
2m 1s (- 42m 59s) (2250 4%) 2.6766
2m 15s (- 42m 47s) (2500 5%) 2.6640
2m 28s (- 42m 36s) (2750 5%) 2.6400
2m 42s (- 42m 20s) (3000 6%) 2.5584
2m 55s (- 42m 5s) (3250 6%) 2.5408
3m 9s (- 41m 56s) (3500 7%) 2.5797
3m 23s (- 41m 44s) (3750 7%) 2.5578
3m 36s (- 41m 32s) (4000 8%) 2.4568
3m 50s (- 41m 18s) (4250 8%) 2.4575
4m 3s (- 41m 1s) (4500 9%) 2.3789
4m 17s (- 40m 48s) (4750 9%) 2.3421
4m 31s (- 40m 45s) (5000 10%) 2.3405
4m 46s (- 40m 39s) (5250 10%) 2.3912
5m 0s (- 40m 28s) (5500 11%) 2.3813
5m 15s (- 40m 25s) (5750 11%) 2.3159
5m 29s (- 40m 16s) (6000 12%) 2.3369
5m 44s (- 40m 8s) (6250 12%) 2.2909
5m 57s (- 39m 55s) (6500 13%) 2.2288
6m 11s (- 39m 40s) (6750 13%) 2.1220
6m 25s (- 39m 25s) (7000 14%) 2.2243
6m 39s (- 39m 16s) (7250 14%) 2.2551
6m 53s (- 39m 2s) (7500 15%) 2.1419
7m 6s (- 38m 47s) (7750 15%) 2.1616
7m 20s (- 38m 32s) (8000 16%) 2.1254
7m 34s (- 38m 19s) (8250 16%) 2.0929
7m 48s (- 38m 6s) (8500 17%) 2.1534
8m 2s (- 37m 53s) (8750 17%) 2.0851
8m 16s (- 37m 39s) (9000 18%) 2.0738
8m 29s (- 37m 24s) (9250 18%) 1.9404
8m 44s (- 37m 14s) (9500 19%) 2.0076
8m 57s (- 36m 57s) (9750 19%) 1.9080
9m 10s (- 36m 40s) (10000 20%) 2.0130
9m 22s (- 36m 21s) (10250 20%) 1.9649
9m 35s (- 36m 4s) (10500 21%) 1.8951
9m 49s (- 35m 50s) (10750 21%) 1.9457
10m 2s (- 35m 37s) (11000 22%) 1.9487
10m 16s (- 35m 23s) (11250 22%) 1.8837
10m 30s (- 35m 10s) (11500 23%) 1.9643
10m 44s (- 34m 58s) (11750 23%) 1.8865
10m 58s (- 34m 44s) (12000 24%) 1.7592
11m 11s (- 34m 30s) (12250 24%) 1.8790
11m 25s (- 34m 16s) (12500 25%) 1.8230
11m 39s (- 34m 2s) (12750 25%) 1.9465
11m 52s (- 33m 48s) (13000 26%) 1.8486
12m 6s (- 33m 34s) (13250 26%) 1.8044
12m 20s (- 33m 21s) (13500 27%) 1.7861
12m 33s (- 33m 6s) (13750 27%) 1.8498
12m 47s (- 32m 52s) (14000 28%) 1.7992
13m 0s (- 32m 38s) (14250 28%) 1.8050
13m 14s (- 32m 24s) (14500 28%) 1.7724
13m 28s (- 32m 11s) (14750 29%) 1.7104
13m 43s (- 32m 1s) (15000 30%) 1.6796
13m 59s (- 31m 53s) (15250 30%) 1.7817
14m 13s (- 31m 40s) (15500 31%) 1.7459
14m 27s (- 31m 26s) (15750 31%) 1.6660
14m 41s (- 31m 14s) (16000 32%) 1.6038
14m 56s (- 31m 1s) (16250 32%) 1.7209
15m 10s (- 30m 47s) (16500 33%) 1.7453
15m 24s (- 30m 34s) (16750 33%) 1.6432
15m 37s (- 30m 20s) (17000 34%) 1.6812
15m 52s (- 30m 7s) (17250 34%) 1.6068
16m 6s (- 29m 54s) (17500 35%) 1.6077
16m 21s (- 29m 43s) (17750 35%) 1.6540
16m 35s (- 29m 30s) (18000 36%) 1.6522
16m 50s (- 29m 17s) (18250 36%) 1.5541
17m 4s (- 29m 5s) (18500 37%) 1.4592
17m 19s (- 28m 52s) (18750 37%) 1.5580
17m 34s (- 28m 40s) (19000 38%) 1.5883
17m 48s (- 28m 27s) (19250 38%) 1.4840
18m 3s (- 28m 14s) (19500 39%) 1.5447
18m 17s (- 28m 1s) (19750 39%) 1.5163
18m 32s (- 27m 48s) (20000 40%) 1.4560
18m 47s (- 27m 37s) (20250 40%) 1.5119
19m 3s (- 27m 26s) (20500 41%) 1.5253
19m 19s (- 27m 14s) (20750 41%) 1.5125
19m 34s (- 27m 1s) (21000 42%) 1.5314
19m 49s (- 26m 49s) (21250 42%) 1.4621
20m 3s (- 26m 35s) (21500 43%) 1.5241
20m 17s (- 26m 21s) (21750 43%) 1.5766
20m 30s (- 26m 5s) (22000 44%) 1.4428
20m 44s (- 25m 51s) (22250 44%) 1.3960
20m 57s (- 25m 37s) (22500 45%) 1.4166
21m 11s (- 25m 23s) (22750 45%) 1.4751
21m 26s (- 25m 10s) (23000 46%) 1.4560
21m 40s (- 24m 56s) (23250 46%) 1.3324
21m 54s (- 24m 42s) (23500 47%) 1.3579
22m 9s (- 24m 29s) (23750 47%) 1.4121
22m 23s (- 24m 15s) (24000 48%) 1.4008
22m 37s (- 24m 1s) (24250 48%) 1.3320
22m 51s (- 23m 47s) (24500 49%) 1.4318
23m 6s (- 23m 34s) (24750 49%) 1.2667
23m 21s (- 23m 21s) (25000 50%) 1.3110
23m 36s (- 23m 8s) (25250 50%) 1.4134
23m 51s (- 22m 55s) (25500 51%) 1.3328
24m 7s (- 22m 42s) (25750 51%) 1.3002
24m 22s (- 22m 30s) (26000 52%) 1.2974
24m 37s (- 22m 16s) (26250 52%) 1.2296
24m 52s (- 22m 3s) (26500 53%) 1.3079
25m 6s (- 21m 49s) (26750 53%) 1.2736
25m 20s (- 21m 35s) (27000 54%) 1.1862
25m 36s (- 21m 22s) (27250 54%) 1.2917
25m 50s (- 21m 8s) (27500 55%) 1.3090
26m 6s (- 20m 55s) (27750 55%) 1.2842
26m 21s (- 20m 42s) (28000 56%) 1.3294
26m 37s (- 20m 30s) (28250 56%) 1.2652
26m 51s (- 20m 15s) (28500 56%) 1.3036
27m 6s (- 20m 2s) (28750 57%) 1.2516
27m 19s (- 19m 47s) (29000 57%) 1.2109
27m 33s (- 19m 33s) (29250 58%) 1.1726
27m 48s (- 19m 19s) (29500 59%) 1.1645
28m 3s (- 19m 5s) (29750 59%) 1.1421
28m 17s (- 18m 51s) (30000 60%) 1.2620
28m 32s (- 18m 37s) (30250 60%) 1.1545
28m 47s (- 18m 24s) (30500 61%) 1.1347
29m 1s (- 18m 10s) (30750 61%) 1.2216
29m 16s (- 17m 56s) (31000 62%) 1.1070
29m 32s (- 17m 43s) (31250 62%) 1.1662
29m 47s (- 17m 30s) (31500 63%) 1.1051
30m 3s (- 17m 16s) (31750 63%) 1.1288
30m 17s (- 17m 2s) (32000 64%) 1.0735
30m 31s (- 16m 48s) (32250 64%) 1.1689
30m 46s (- 16m 34s) (32500 65%) 1.1648
30m 59s (- 16m 19s) (32750 65%) 1.1337
31m 13s (- 16m 5s) (33000 66%) 1.0650
31m 27s (- 15m 50s) (33250 66%) 1.2024
31m 41s (- 15m 36s) (33500 67%) 1.0940
31m 55s (- 15m 22s) (33750 67%) 1.1060
32m 10s (- 15m 8s) (34000 68%) 1.1126
32m 25s (- 14m 54s) (34250 68%) 1.1412
32m 40s (- 14m 40s) (34500 69%) 1.0815
32m 54s (- 14m 26s) (34750 69%) 1.0864
33m 9s (- 14m 12s) (35000 70%) 1.0517
33m 24s (- 13m 58s) (35250 70%) 1.0901
33m 38s (- 13m 44s) (35500 71%) 1.0965
33m 52s (- 13m 30s) (35750 71%) 1.1462
34m 7s (- 13m 16s) (36000 72%) 1.0654
34m 21s (- 13m 1s) (36250 72%) 0.9683
34m 35s (- 12m 47s) (36500 73%) 1.0852
34m 49s (- 12m 33s) (36750 73%) 1.0260
35m 3s (- 12m 18s) (37000 74%) 1.0323
35m 17s (- 12m 4s) (37250 74%) 1.0400
35m 31s (- 11m 50s) (37500 75%) 0.9172
35m 45s (- 11m 36s) (37750 75%) 0.9479
36m 0s (- 11m 22s) (38000 76%) 0.9919
36m 14s (- 11m 7s) (38250 76%) 1.0249
36m 28s (- 10m 53s) (38500 77%) 0.9339
36m 42s (- 10m 39s) (38750 77%) 0.9684
36m 56s (- 10m 25s) (39000 78%) 0.9866
37m 10s (- 10m 10s) (39250 78%) 0.9021
37m 25s (- 9m 56s) (39500 79%) 0.9421
37m 39s (- 9m 42s) (39750 79%) 0.9982
37m 53s (- 9m 28s) (40000 80%) 0.8918
38m 7s (- 9m 14s) (40250 80%) 0.9356
38m 22s (- 8m 59s) (40500 81%) 0.9880
38m 36s (- 8m 45s) (40750 81%) 1.0126
38m 51s (- 8m 31s) (41000 82%) 0.8997
39m 5s (- 8m 17s) (41250 82%) 0.9651
39m 19s (- 8m 3s) (41500 83%) 0.9271
39m 33s (- 7m 49s) (41750 83%) 0.8294
39m 47s (- 7m 34s) (42000 84%) 0.8939
40m 2s (- 7m 20s) (42250 84%) 0.9690
40m 16s (- 7m 6s) (42500 85%) 0.8294
40m 30s (- 6m 52s) (42750 85%) 0.9057
40m 44s (- 6m 37s) (43000 86%) 0.9040
40m 59s (- 6m 23s) (43250 86%) 0.9020
41m 13s (- 6m 9s) (43500 87%) 0.8598
41m 28s (- 5m 55s) (43750 87%) 0.7808
41m 43s (- 5m 41s) (44000 88%) 0.8651
41m 58s (- 5m 27s) (44250 88%) 0.8298
42m 13s (- 5m 13s) (44500 89%) 0.8292
42m 30s (- 4m 59s) (44750 89%) 0.9227
42m 44s (- 4m 44s) (45000 90%) 0.8551
43m 2s (- 4m 31s) (45250 90%) 0.7870
43m 21s (- 4m 17s) (45500 91%) 0.8815
43m 38s (- 4m 3s) (45750 91%) 0.8191
43m 57s (- 3m 49s) (46000 92%) 0.8107
44m 14s (- 3m 35s) (46250 92%) 0.7644
44m 29s (- 3m 20s) (46500 93%) 0.7996
44m 45s (- 3m 6s) (46750 93%) 0.7954
45m 2s (- 2m 52s) (47000 94%) 0.7589
45m 19s (- 2m 38s) (47250 94%) 0.7825
45m 36s (- 2m 24s) (47500 95%) 0.8812
45m 52s (- 2m 9s) (47750 95%) 0.8011
46m 8s (- 1m 55s) (48000 96%) 0.7464
46m 24s (- 1m 41s) (48250 96%) 0.8238
46m 42s (- 1m 26s) (48500 97%) 0.7181
46m 57s (- 1m 12s) (48750 97%) 0.7389
47m 11s (- 0m 57s) (49000 98%) 0.8540
47m 26s (- 0m 43s) (49250 98%) 0.8352
47m 39s (- 0m 28s) (49500 99%) 0.7494
47m 53s (- 0m 14s) (49750 99%) 0.8223
48m 7s (- 0m 0s) (50000 100%) 0.7700






    





<Figure size 432x288 with 0 Axes>

1.2 Decoder with Attention

Why we need attention mechanism ?
In short version, because seq2seq could achieve better performance and consumes less time with attention mechanism.
In long version, attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. For simplicity, we change DecoderLSTM to AttentionDecoderLSTM and some hepler function and then we can train model.

Very Detail of AttentionDecoderLSTM
Since there are many ways to do attention, we select a simple way to do that.
First we calculate a set of attention weights.
These will be multiplied by the encoder output vectors to create a weighted combination. The result (called attention_applied in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.



In [12]:

    
class AttentionDecoderLSTM(nn.Module):
    def __init__(self, hidden_size: int, output_size: int, dropout_p=0.1, max_length=MAX_LENGTH):
        """DecoderLSTM with attention mechanism
        """
        super(AttentionDecoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length
        # Retrieve word embeddings with dimentionality hidden_size 
        # using indices with dimentionality input_size, embeddding is learnable
        # After embedding, input vector with input_size would be converted to hidden_size
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        # W1 
        self.attention = nn.Linear(self.hidden_size * 2, self.max_length)
        # W2 
        self.attention_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        # prediction layer 
        self.out = nn.Linear(self.hidden_size, self.output_size)
        # activation
        self.activation_fn = F.relu
        
    def forward(self, inputs, state, encoder_outputs):
        """Forward
        Args:
            inputs: [1, hidden_size]
            state : ([1, 1, hidden_size], [1, 1, hidden_size])
            encoder_outputs: [max_length, hidden_size]
        Returns:
            output:
            state: (hidden, cell)
        """
        # embedded: [1, 1, hidden_size]
        embedded = self.embedding(inputs).view(1, 1, -1)
        embedded = self.dropout(embedded)
        (hidden, cell) = state
       
        # attention_weights: [1, max_length]
        attention_weights = F.softmax(
            self.attention(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        # attention_applied: [1, hidden_size]
        # torch.bmm == @, matrix muplication
        attention_applied = torch.bmm(attention_weights.unsqueeze(0),
                                      encoder_outputs.unsqueeze(0))
        
        # output: [1, hidden_size * 2]
        output = torch.cat((embedded[0], attention_applied[0]), 1)
        # output: [1, 1, hidden_size]
        output = self.attention_combine(output).unsqueeze(0)
        
        output = self.activation_fn(output)
        # output, [1, 1, output_size]
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        
        # output, [1, output_size]
        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, (hidden, cell), attention_weights
    
    def init_hidden(self):
        """Init hidden
        Returns:
            hidden:
            cell:
        """
        cell = torch.zeros(1, 1, self.hidden_size, device=device)
        hidden = torch.zeros(1, 1, self.hidden_size, device=device)
        return hidden, cell



In [13]:

    
def train_by_sentence_attn(input_tensor, target_tensor, encoder, decoder, 
                      encoder_optimizer, decoder_optimizer, loss_fn, 
                      use_teacher_forcing=True, reverse_source_sentence=True,
                      max_length=MAX_LENGTH):
    """Train by single sentence using EncoderLSTM and DecoderLSTM
       including training and update model, combining attention mechanism.
    Args:
        input_tensor: [input_sequence_len, 1, hidden_size]
        target_tensor: [target_sequence_len, 1, hidden_size]
        encoder: EncoderLSTM
        decoder: DecoderLSTM
        encoder_optimizer: optimizer for encoder
        decoder_optimizer: optimizer for decoder
        loss_fn: loss function
        use_teacher_forcing: True is to Feed the target as the next input, 
                             False is to use its own predictions as the next input
        max_length: max length for input and output
    Returns:
        loss: scalar
    """
    if reverse_source_sentence:
        input_tensor = torch.flip(input_tensor, [0])
        
    hidden, cell = encoder.init_hidden()

    # Clears the gradients of all optimized torch.Tensors'
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Get sequence length of the input and target sentences.
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
    # encoder outputs:  [max_length, hidden_size]
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # Get encoder outputs
    for ei in range(input_length):
        encoder_output, (hidden, cell) = encoder(
            input_tensor[ei], (hidden, cell))
        encoder_outputs[ei] = encoder_output[0, 0]
    
    # First input for the decoder
    decoder_input = torch.tensor([[SOS_token]], device=device)
    
    # Last state of encoder as the init state of decoder
    decoder_hidden = (hidden, cell)

    for di in range(target_length):
        # !! Most important change, apply attention mechnism 
        decoder_output, (hidden, cell), _ = decoder(
            decoder_input, (hidden, cell), encoder_outputs)
        
        if use_teacher_forcing:
            # Feed the target as the next input
            loss += loss_fn(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing
        else:
            # Use its own predictions as the next input
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()
            loss += loss_fn(decoder_output, target_tensor[di])

        # End if decoder output End of Signal(EOS)
        if decoder_input.item() == EOS_token:
            break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length



In [14]:

    
def train_attn(encoder, decoder, n_iters, reverse_source_sentence=True, 
          use_teacher_forcing=True,
          print_every=1000, plot_every=100, 
          learning_rate=0.01):
    """Train of Seq2seq with attention 
    Args:
        encoder: EncoderLSTM
        decoder: DecoderLSTM
        n_iters: train with n_iters sentences without replacement
        reverse_source_sentence: True is to reverse the source sentence 
                                 but keep order of target unchanged,
                                 False is to keep order of the source sentence 
                                 target unchanged
        use_teacher_forcing: True is to Feed the target as the next input, 
                             False is to use its own predictions as the next input
        print_every: print log every print_every 
        plot_every: plot every plot_every 
        learning_rate: 
        
    """
    
    start = time.time()
    
    plot_losses = []
    print_loss_total = 0
    plot_loss_total = 0

    # Use SGD to optimize encoder and decoder parameters
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    
    # Obtain training input 
    training_pairs = [tensor_from_pair(random.choice(pairs), input_lang, output_lang)
                      for _ in range(n_iters)]
    
    # Negative log likelihood loss
    loss_fn = nn.NLLLoss()

    for i in range(1, n_iters+1):
        # Get a pair of sentences and move them to device, 
        # training_pair: ([Seq_size, 1, input_size], [Seq_size, 1, input_size])
        training_pair = training_pairs[i-1]
        input_tensor = training_pair[0].to(device)
        target_tensor = training_pair[1].to(device)            
            
        # Train by a pair of source sentence and target sentence
        loss = train_by_sentence_attn(input_tensor, target_tensor, 
                                      encoder, decoder,
                                      encoder_optimizer, decoder_optimizer, 
                                      loss_fn, use_teacher_forcing=use_teacher_forcing,
                                      reverse_source_sentence=reverse_source_sentence)
        
        print_loss_total += loss
        plot_loss_total += loss

        if i % print_every == 0:
            # Print Loss
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print("%s (%d %d%%) %.4f" % (time_since(start, i / n_iters),
                                         i, i / n_iters * 100, print_loss_avg))

        if i % plot_every == 0:
            # Plot
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    
    # show plot
    show_plot(plot_losses)



In [15]:

    
def evaluate_by_sentence_attn(encoder, decoder, sentence, 
                              reverse_source_sentence=True, max_length=MAX_LENGTH):
    """Evalutae on a source sentence with model trained with attention mechanism
    Args:
        encoder
        decoder
        sentence
        max_length
    Return:
        decoded_words: predicted sentence
    """
    with torch.no_grad():
        # Get tensor of sentence
        input_tensor = tensor_from_sentence(input_lang, sentence).to(device)
        input_length = input_tensor.size(0)
        
        if reverse_source_sentence:
            input_tensor = torch.flip(input_tensor, [0])
        
        # init state for encoder
        (hidden, cell) = encoder.init_hidden()

        # encoder outputs: [max_length, hidden_size]
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, (hidden, cell) = encoder(input_tensor[ei],
                                                     (hidden, cell))
            encoder_outputs[ei] += encoder_output[0, 0]
            
        # Last state of encoder as the init state of decoder
        decoder_input = torch.tensor([[SOS_token]], device=device)
        decoder_hidden = (hidden, cell)
        decoded_words = []
        
        # CHANGE!! Add decoder_attentions to collect attention map
        decoder_attentions = torch.zeros(max_length, max_length)

        # When evaluate, use its own predictions as the next input
        for di in range(max_length):
            # CHANGE!! Attention
            decoder_output, (hidden, cell), decoder_attention = \
                decoder(decoder_input, (hidden, cell), encoder_outputs)
            topv, topi = decoder_output.data.topk(1)
            # CHANGE!!
            decoder_attentions[di] = decoder_attention.data
            if topi.item() == EOS_token:
                decoded_words.append("<EOS>")
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
                
            decoder_input = topi.squeeze().detach()

    return decoded_words, decoder_attentions[:di + 1]



In [16]:

    
def show_attention(input_sentence, output_words, attentions):
    """Show attention between input sentence and output words
    """
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()



In [17]:

    
def evaluate_and_show_attention(input_sentence, encoder, decoder):
    """Evaluate and show attention for a input sentence
    """
    output_words, attentions = evaluate_by_sentence_attn(
        encoder, decoder, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    show_attention(input_sentence, output_words, attentions)



In [20]:

    
setup_seed(45)
hidden_size = 256
# Reverse the order of source input sentence
reverse_source_sentence = True
# Feed the target as the next input
use_teacher_forcing = True
encoder = EncoderLSTM(input_lang.n_words, hidden_size).to(device)
decoder = AttentionDecoderLSTM(hidden_size, output_lang.n_words).to(device)
print(">> Model is on: {}".format(next(encoder.parameters()).is_cuda))
print(">> Model is on: {}".format(next(decoder.parameters()).is_cuda))









    



>> Model is on: True
>> Model is on: True



In [21]:

    
iters = 50000
train_attn(encoder, decoder, iters, reverse_source_sentence=reverse_source_sentence, 
           use_teacher_forcing=use_teacher_forcing,print_every=250, plot_every=250)









    



0m 23s (- 77m 56s) (250 0%) 4.9045
0m 41s (- 68m 39s) (500 1%) 3.3049
1m 0s (- 65m 48s) (750 1%) 3.0303
1m 17s (- 63m 40s) (1000 2%) 2.8569
1m 36s (- 62m 31s) (1250 2%) 2.8321
1m 54s (- 61m 38s) (1500 3%) 2.7320
2m 11s (- 60m 23s) (1750 3%) 2.6652
2m 28s (- 59m 34s) (2000 4%) 2.6601
2m 46s (- 58m 53s) (2250 4%) 2.6078
3m 10s (- 60m 10s) (2500 5%) 2.5934
3m 30s (- 60m 12s) (2750 5%) 2.5855
3m 49s (- 60m 2s) (3000 6%) 2.4951
4m 12s (- 60m 35s) (3250 6%) 2.4717
4m 32s (- 60m 23s) (3500 7%) 2.4963
4m 54s (- 60m 38s) (3750 7%) 2.4969
5m 15s (- 60m 30s) (4000 8%) 2.3994
5m 35s (- 60m 11s) (4250 8%) 2.3991
5m 55s (- 59m 56s) (4500 9%) 2.3072
6m 17s (- 59m 53s) (4750 9%) 2.2703
6m 38s (- 59m 48s) (5000 10%) 2.2681
6m 56s (- 59m 8s) (5250 10%) 2.3253
7m 15s (- 58m 47s) (5500 11%) 2.3062
7m 35s (- 58m 24s) (5750 11%) 2.2431
7m 54s (- 58m 2s) (6000 12%) 2.2481
8m 14s (- 57m 38s) (6250 12%) 2.2147
8m 34s (- 57m 20s) (6500 13%) 2.1789
8m 52s (- 56m 54s) (6750 13%) 2.0689
9m 12s (- 56m 31s) (7000 14%) 2.1624
9m 32s (- 56m 15s) (7250 14%) 2.1783
9m 51s (- 55m 51s) (7500 15%) 2.0618
10m 12s (- 55m 38s) (7750 15%) 2.0752
10m 31s (- 55m 14s) (8000 16%) 2.0351
10m 49s (- 54m 45s) (8250 16%) 2.0126
11m 8s (- 54m 21s) (8500 17%) 2.0556
11m 27s (- 53m 58s) (8750 17%) 2.0194
11m 45s (- 53m 35s) (9000 18%) 2.0155
12m 3s (- 53m 8s) (9250 18%) 1.8749
12m 21s (- 52m 42s) (9500 19%) 1.9430
12m 38s (- 52m 12s) (9750 19%) 1.8124
12m 56s (- 51m 45s) (10000 20%) 1.9113
13m 14s (- 51m 19s) (10250 20%) 1.8959
13m 31s (- 50m 52s) (10500 21%) 1.8226
13m 49s (- 50m 26s) (10750 21%) 1.8846
14m 7s (- 50m 3s) (11000 22%) 1.8598
14m 25s (- 49m 41s) (11250 22%) 1.8070
14m 45s (- 49m 24s) (11500 23%) 1.8770
15m 3s (- 49m 2s) (11750 23%) 1.7991
15m 22s (- 48m 40s) (12000 24%) 1.6979
15m 40s (- 48m 19s) (12250 24%) 1.7849
15m 58s (- 47m 56s) (12500 25%) 1.7383
16m 18s (- 47m 38s) (12750 25%) 1.8461
16m 39s (- 47m 23s) (13000 26%) 1.7735
16m 58s (- 47m 4s) (13250 26%) 1.7250
17m 16s (- 46m 42s) (13500 27%) 1.7031
17m 34s (- 46m 19s) (13750 27%) 1.7557
17m 52s (- 45m 58s) (14000 28%) 1.7034
18m 12s (- 45m 41s) (14250 28%) 1.7474
18m 32s (- 45m 23s) (14500 28%) 1.7002
18m 52s (- 45m 6s) (14750 29%) 1.6098
19m 12s (- 44m 49s) (15000 30%) 1.6132
19m 31s (- 44m 30s) (15250 30%) 1.7066
19m 51s (- 44m 11s) (15500 31%) 1.6781
20m 9s (- 43m 51s) (15750 31%) 1.5791
20m 28s (- 43m 31s) (16000 32%) 1.5366
20m 47s (- 43m 10s) (16250 32%) 1.6449
21m 8s (- 42m 56s) (16500 33%) 1.6655
21m 27s (- 42m 36s) (16750 33%) 1.5604
21m 48s (- 42m 20s) (17000 34%) 1.5838
22m 8s (- 42m 1s) (17250 34%) 1.5468
22m 26s (- 41m 39s) (17500 35%) 1.5449
22m 44s (- 41m 18s) (17750 35%) 1.5705
23m 3s (- 41m 0s) (18000 36%) 1.5782
23m 21s (- 40m 38s) (18250 36%) 1.4957
23m 39s (- 40m 16s) (18500 37%) 1.3964
23m 56s (- 39m 54s) (18750 37%) 1.4830
24m 15s (- 39m 34s) (19000 38%) 1.5285
24m 32s (- 39m 12s) (19250 38%) 1.4140
24m 51s (- 38m 52s) (19500 39%) 1.4499
25m 8s (- 38m 30s) (19750 39%) 1.4234
25m 26s (- 38m 10s) (20000 40%) 1.4060
25m 44s (- 37m 49s) (20250 40%) 1.4435
26m 3s (- 37m 29s) (20500 41%) 1.4514
26m 20s (- 37m 7s) (20750 41%) 1.4453
26m 37s (- 36m 46s) (21000 42%) 1.4741
26m 56s (- 36m 26s) (21250 42%) 1.3784
27m 15s (- 36m 7s) (21500 43%) 1.4439
27m 32s (- 35m 46s) (21750 43%) 1.4836
27m 49s (- 35m 25s) (22000 44%) 1.3703
28m 7s (- 35m 4s) (22250 44%) 1.3226
28m 25s (- 34m 43s) (22500 45%) 1.3610
28m 42s (- 34m 23s) (22750 45%) 1.4003
29m 1s (- 34m 3s) (23000 46%) 1.3914
29m 18s (- 33m 43s) (23250 46%) 1.2699
29m 36s (- 33m 23s) (23500 47%) 1.2957
29m 54s (- 33m 3s) (23750 47%) 1.3403
30m 11s (- 32m 42s) (24000 48%) 1.3439
30m 29s (- 32m 22s) (24250 48%) 1.2482
30m 46s (- 32m 2s) (24500 49%) 1.3789
31m 4s (- 31m 41s) (24750 49%) 1.1900
31m 22s (- 31m 22s) (25000 50%) 1.2474
31m 40s (- 31m 2s) (25250 50%) 1.3320
31m 58s (- 30m 43s) (25500 51%) 1.2478
32m 17s (- 30m 24s) (25750 51%) 1.2392
32m 34s (- 30m 4s) (26000 52%) 1.2369
32m 52s (- 29m 44s) (26250 52%) 1.1629
33m 11s (- 29m 26s) (26500 53%) 1.2625
33m 30s (- 29m 7s) (26750 53%) 1.2236
33m 49s (- 28m 48s) (27000 54%) 1.1323
34m 8s (- 28m 29s) (27250 54%) 1.2009
34m 25s (- 28m 10s) (27500 55%) 1.2412
34m 43s (- 27m 50s) (27750 55%) 1.2053
35m 0s (- 27m 30s) (28000 56%) 1.2504
35m 18s (- 27m 10s) (28250 56%) 1.1889
35m 36s (- 26m 51s) (28500 56%) 1.2637
35m 53s (- 26m 32s) (28750 57%) 1.2014
36m 12s (- 26m 12s) (29000 57%) 1.1773
36m 30s (- 25m 54s) (29250 58%) 1.1245
36m 48s (- 25m 34s) (29500 59%) 1.1128
37m 7s (- 25m 16s) (29750 59%) 1.1001
37m 25s (- 24m 56s) (30000 60%) 1.2020
37m 43s (- 24m 37s) (30250 60%) 1.0931
38m 0s (- 24m 18s) (30500 61%) 1.0847
38m 18s (- 23m 58s) (30750 61%) 1.1683
38m 35s (- 23m 39s) (31000 62%) 1.0578
38m 53s (- 23m 20s) (31250 62%) 1.1204
39m 11s (- 23m 0s) (31500 63%) 1.0375
39m 29s (- 22m 42s) (31750 63%) 1.0673
39m 47s (- 22m 23s) (32000 64%) 1.0291
40m 6s (- 22m 4s) (32250 64%) 1.1162
40m 24s (- 21m 45s) (32500 65%) 1.1235
40m 41s (- 21m 26s) (32750 65%) 1.0975
40m 59s (- 21m 6s) (33000 66%) 1.0241
41m 17s (- 20m 48s) (33250 66%) 1.1561
41m 36s (- 20m 29s) (33500 67%) 1.0525
41m 55s (- 20m 11s) (33750 67%) 1.0605
42m 13s (- 19m 52s) (34000 68%) 1.0551
42m 31s (- 19m 33s) (34250 68%) 1.0728
42m 49s (- 19m 14s) (34500 69%) 1.0422
43m 8s (- 18m 56s) (34750 69%) 1.0189
43m 26s (- 18m 37s) (35000 70%) 0.9925
43m 44s (- 18m 18s) (35250 70%) 1.0572
44m 2s (- 17m 59s) (35500 71%) 1.0211
44m 19s (- 17m 40s) (35750 71%) 1.0857
44m 37s (- 17m 21s) (36000 72%) 1.0427
44m 55s (- 17m 2s) (36250 72%) 0.9366
45m 12s (- 16m 43s) (36500 73%) 1.0282
45m 30s (- 16m 24s) (36750 73%) 0.9766
45m 48s (- 16m 5s) (37000 74%) 0.9918
46m 6s (- 15m 46s) (37250 74%) 0.9713
46m 24s (- 15m 28s) (37500 75%) 0.8682
46m 42s (- 15m 9s) (37750 75%) 0.8866
47m 0s (- 14m 50s) (38000 76%) 0.9383
47m 18s (- 14m 31s) (38250 76%) 0.9673
47m 36s (- 14m 13s) (38500 77%) 0.8919
47m 54s (- 13m 54s) (38750 77%) 0.9282
48m 12s (- 13m 35s) (39000 78%) 0.9390
48m 30s (- 13m 17s) (39250 78%) 0.8759
48m 48s (- 12m 58s) (39500 79%) 0.9049
49m 6s (- 12m 39s) (39750 79%) 0.9363
49m 24s (- 12m 21s) (40000 80%) 0.8546
49m 41s (- 12m 2s) (40250 80%) 0.8826
49m 59s (- 11m 43s) (40500 81%) 0.9364
50m 17s (- 11m 24s) (40750 81%) 0.9326
50m 35s (- 11m 6s) (41000 82%) 0.8726
50m 52s (- 10m 47s) (41250 82%) 0.9256
51m 10s (- 10m 28s) (41500 83%) 0.8858
51m 29s (- 10m 10s) (41750 83%) 0.7803
51m 47s (- 9m 51s) (42000 84%) 0.8532
52m 6s (- 9m 33s) (42250 84%) 0.9056
52m 24s (- 9m 14s) (42500 85%) 0.7939
52m 41s (- 8m 56s) (42750 85%) 0.8685
52m 59s (- 8m 37s) (43000 86%) 0.8675
53m 17s (- 8m 19s) (43250 86%) 0.8868
53m 34s (- 8m 0s) (43500 87%) 0.8165
53m 52s (- 7m 41s) (43750 87%) 0.7273
54m 10s (- 7m 23s) (44000 88%) 0.8150
54m 28s (- 7m 4s) (44250 88%) 0.8015
54m 45s (- 6m 46s) (44500 89%) 0.7703
55m 2s (- 6m 27s) (44750 89%) 0.8699
55m 20s (- 6m 8s) (45000 90%) 0.8267
55m 38s (- 5m 50s) (45250 90%) 0.7528
55m 55s (- 5m 31s) (45500 91%) 0.8305
56m 13s (- 5m 13s) (45750 91%) 0.7830
56m 31s (- 4m 54s) (46000 92%) 0.8001
56m 48s (- 4m 36s) (46250 92%) 0.7384
57m 5s (- 4m 17s) (46500 93%) 0.7825
57m 23s (- 3m 59s) (46750 93%) 0.7710
57m 41s (- 3m 40s) (47000 94%) 0.7454
57m 58s (- 3m 22s) (47250 94%) 0.7490
58m 16s (- 3m 4s) (47500 95%) 0.8604
58m 33s (- 2m 45s) (47750 95%) 0.7776
58m 51s (- 2m 27s) (48000 96%) 0.7263
59m 9s (- 2m 8s) (48250 96%) 0.8098
59m 28s (- 1m 50s) (48500 97%) 0.6916
59m 45s (- 1m 31s) (48750 97%) 0.7064
60m 3s (- 1m 13s) (49000 98%) 0.8158
60m 24s (- 0m 55s) (49250 98%) 0.7894
60m 43s (- 0m 36s) (49500 99%) 0.7465
61m 1s (- 0m 18s) (49750 99%) 0.7975
61m 19s (- 0m 0s) (50000 100%) 0.7190






    





<Figure size 432x288 with 0 Axes>



In [22]:

    
evaluate_and_show_attention("elle a cinq ans de moins que moi .", encoder, decoder)

evaluate_and_show_attention("elle est trop petit .", encoder, decoder)

evaluate_and_show_attention("je ne crains pas de mourir .", encoder, decoder)

evaluate_and_show_attention("c est un jeune directeur plein de talent .", encoder, decoder)









    



input = elle a cinq ans de moins que moi .
output = she is two years younger than me . <EOS>






    












    



input = elle est trop petit .
output = she is too drunk . <EOS>






    












    



input = je ne crains pas de mourir .
output = i m not afraid of making mistakes . <EOS>






    












    



input = c est un jeune directeur plein de talent .
output = he s a very talented writer . <EOS>

2.1 Diving into LSTM

2.1.1 Implement your own LSTM from scratch using pytorch

@ 指的是 matrix mipliplication



In [23]:

    
class NaiveLSTM(nn.Module):
    """Naive LSTM like nn.LSTM"""
    def __init__(self, input_size: int, hidden_size: int):
        super(NaiveLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size

        # input gate
        self.w_ii = Parameter(Tensor(hidden_size, input_size))
        self.w_hi = Parameter(Tensor(hidden_size, hidden_size))
        self.b_ii = Parameter(Tensor(hidden_size, 1))
        self.b_hi = Parameter(Tensor(hidden_size, 1))

        # forget gate
        self.w_if = Parameter(Tensor(hidden_size, input_size))
        self.w_hf = Parameter(Tensor(hidden_size, hidden_size))
        self.b_if = Parameter(Tensor(hidden_size, 1))
        self.b_hf = Parameter(Tensor(hidden_size, 1))

        # output gate
        self.w_io = Parameter(Tensor(hidden_size, input_size))
        self.w_ho = Parameter(Tensor(hidden_size, hidden_size))
        self.b_io = Parameter(Tensor(hidden_size, 1))
        self.b_ho = Parameter(Tensor(hidden_size, 1))
        
        # cell
        self.w_ig = Parameter(Tensor(hidden_size, input_size))
        self.w_hg = Parameter(Tensor(hidden_size, hidden_size))
        self.b_ig = Parameter(Tensor(hidden_size, 1))
        self.b_hg = Parameter(Tensor(hidden_size, 1))

        self.reset_weigths()

    def reset_weigths(self):
        """reset weights
        """
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            init.uniform_(weight, -stdv, stdv)

    def forward(self, inputs: Tensor, state: Tuple[Tensor]) \
        -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
        """Forward
        Args:
            inputs: [1, 1, input_size]
            state: ([1, 1, hidden_size], [1, 1, hidden_size])
        """

#         seq_size, batch_size, _ = inputs.size()

        if state is None:
            h_t = torch.zeros(1, self.hidden_size).t()
            c_t = torch.zeros(1, self.hidden_size).t()
        else:
            (h, c) = state
            h_t = h.squeeze(0).t()
            c_t = c.squeeze(0).t()

        hidden_seq = []

        seq_size = 1
        for t in range(seq_size):
            x = inputs[:, t, :].t()
            # input gate
            i = torch.sigmoid(self.w_ii @ x + self.b_ii + self.w_hi @ h_t +
                              self.b_hi)
            # forget gate
            f = torch.sigmoid(self.w_if @ x + self.b_if + self.w_hf @ h_t +
                              self.b_hf)
            # cell
            g = torch.tanh(self.w_ig @ x + self.b_ig + self.w_hg @ h_t
                           + self.b_hg)
            # output gate
            o = torch.sigmoid(self.w_io @ x + self.b_io + self.w_ho @ h_t +
                              self.b_ho)
            
            c_next = f * c_t + i * g
            h_next = o * torch.tanh(c_next)
            c_next_t = c_next.t().unsqueeze(0)
            h_next_t = h_next.t().unsqueeze(0)
            hidden_seq.append(h_next_t)

        hidden_seq = torch.cat(hidden_seq, dim=0)
        return hidden_seq, (h_next_t, c_next_t)



In [24]:

    
def reset_weigths(model):
    """reset weights
    """
    for weight in model.parameters():
        init.constant_(weight, 0.5)



In [25]:

    
inputs = torch.ones(1, 1, 10)
h0 = torch.ones(1, 1, 20)
c0 = torch.ones(1, 1, 20)
print(h0.shape, h0)
print(c0.shape, c0)
print(inputs.shape, inputs)









    



torch.Size([1, 1, 20]) tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1.]]])
torch.Size([1, 1, 20]) tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1.]]])
torch.Size([1, 1, 10]) tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])



In [26]:

    
# test naive_lstm with input_size=10, hidden_size=20
naive_lstm = NaiveLSTM(10, 20)
reset_weigths(naive_lstm)



In [27]:

    
output1, (hn1, cn1) = naive_lstm(inputs, (h0, c0))



In [28]:

    
print(hn1.shape, cn1.shape, output1.shape)
print(hn1)
print(cn1)
print(output1)









    



torch.Size([1, 1, 20]) torch.Size([1, 1, 20]) torch.Size([1, 1, 20])
tensor([[[0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640]]], grad_fn=<UnsqueezeBackward0>)
tensor([[[2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000,
          2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000,
          2.0000, 2.0000, 2.0000, 2.0000]]], grad_fn=<UnsqueezeBackward0>)
tensor([[[0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640]]], grad_fn=<CatBackward>)

2.1.2 Compare with Official LSTM



In [29]:

    
# Use official lstm with input_size=10, hidden_size=20
lstm = nn.LSTM(10, 20)
reset_weigths(lstm)



In [30]:

    
output2, (hn2, cn2) = lstm(inputs, (h0, c0))
print(hn2.shape, cn2.shape, output2.shape)
print(hn2)
print(cn2)
print(output2)









    



torch.Size([1, 1, 20]) torch.Size([1, 1, 20]) torch.Size([1, 1, 20])
tensor([[[0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640]]], grad_fn=<StackBackward>)
tensor([[[2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000,
          2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000, 2.0000,
          2.0000, 2.0000, 2.0000, 2.0000]]], grad_fn=<StackBackward>)
tensor([[[0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640, 0.9640,
          0.9640, 0.9640, 0.9640, 0.9640]]], grad_fn=<StackBackward>)

2.2 Observing the Grad Vanishing of LSTM and RNN



In [31]:

    
# Implementation of RNN for our experiment 
from NaiveRNN import NaiveRNN



In [32]:

    
hidden_size = 50
input_size = 100
sequence_len = 100
high = 1000000



In [33]:

    
# Generate random input with sequence_len=100
test_idx = torch.randint(high=high, size=(1, sequence_len)).to(device)
print(test_idx)









    



tensor([[467641, 438165, 935784, 348843, 456126, 678722, 544521, 629650, 913052,
         515704, 359498, 813691,  85030, 812238,  81280, 534390, 213301, 739639,
         946166, 142993, 176025, 324614, 504309, 253316,  20391, 536403, 934167,
         390225, 640486, 736492, 462829, 287346, 267072, 136907, 162403, 581682,
         251738, 852900, 377706,  95229, 817013, 533409, 486543, 639531, 823225,
         393774, 451828, 300227, 620261, 894586, 392700, 298598, 399744, 551383,
         934141, 695864, 855742, 290926, 663304, 578266, 672847, 429797, 580725,
         394330, 248653,  28963, 842417, 337341, 445876, 271879, 831151, 824026,
         226680, 804180, 468878, 716080, 324929, 540810, 686717, 493021, 133503,
         913081, 488010, 758172, 446451, 518270, 381352, 378181, 296251, 519946,
         205581, 921540, 626297, 817562, 742148, 732258, 934476, 589189, 638731,
         298330]], device='cuda:0')



In [34]:

    
setup_seed(45)
embeddings = nn.Embedding(high, input_size).to(device)
test_embeddings = embeddings(test_idx).to(device)
print(test_embeddings)

h_0 = torch.zeros(1, hidden_size, requires_grad=True).to(device)
h_t = h_0
print(h_0)
print(test_embeddings)









    



tensor([[[ 0.5697,  0.7304, -0.4647,  ...,  0.7549,  0.3112, -0.4582],
         [ 1.5171,  0.7328,  0.0803,  ...,  1.2385,  1.2259, -0.5259],
         [-0.2804, -0.4395,  1.5441,  ..., -0.8644,  0.1858, -0.9446],
         ...,
         [ 0.5019, -0.8431, -0.9560,  ...,  0.2607,  1.2035,  0.6892],
         [-0.5062,  0.8530,  0.3743,  ..., -0.4148, -0.3384,  0.9264],
         [-2.1523,  0.6292, -0.9732,  ..., -0.2591, -1.6320, -0.1915]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]], device='cuda:0', grad_fn=<CopyBackwards>)
tensor([[[ 0.5697,  0.7304, -0.4647,  ...,  0.7549,  0.3112, -0.4582],
         [ 1.5171,  0.7328,  0.0803,  ...,  1.2385,  1.2259, -0.5259],
         [-0.2804, -0.4395,  1.5441,  ..., -0.8644,  0.1858, -0.9446],
         ...,
         [ 0.5019, -0.8431, -0.9560,  ...,  0.2607,  1.2035,  0.6892],
         [-0.5062,  0.8530,  0.3743,  ..., -0.4148, -0.3384,  0.9264],
         [-2.1523,  0.6292, -0.9732,  ..., -0.2591, -1.6320, -0.1915]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)

2.2.1 Grad of RNN



In [35]:

    
def rnn_step(x, h, w_ih, b_ih, w_hh, b_hh):
    """run rnn a step
    """
    h = torch.tanh(w_ih @ x.t() + b_ih + w_hh @ h.t() + b_hh)
    h_t = h.t()
    return h_t



In [36]:

    
print(test_embeddings)

rnn = NaiveRNN(input_size, hidden_size).to(device)
iters = test_embeddings.size(1)
rnn_grads = []
for t in range(iters):
    h_t = rnn_step(test_embeddings[: , t, :], h_t, 
                           rnn.w_ih, rnn.b_ih, rnn.w_hh, rnn.b_hh)
    loss = h_t.abs().sum()
    h_0.retain_grad()
    loss.backward(retain_graph=True)
    rnn_grads.append(torch.norm(h_0.grad).item())
    h_0.grad.zero_()
    rnn.zero_grad()









    



tensor([[[ 0.5697,  0.7304, -0.4647,  ...,  0.7549,  0.3112, -0.4582],
         [ 1.5171,  0.7328,  0.0803,  ...,  1.2385,  1.2259, -0.5259],
         [-0.2804, -0.4395,  1.5441,  ..., -0.8644,  0.1858, -0.9446],
         ...,
         [ 0.5019, -0.8431, -0.9560,  ...,  0.2607,  1.2035,  0.6892],
         [-0.5062,  0.8530,  0.3743,  ..., -0.4148, -0.3384,  0.9264],
         [-2.1523,  0.6292, -0.9732,  ..., -0.2591, -1.6320, -0.1915]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)



In [37]:

    
plt.plot(rnn_grads)









    Out[37]:





[<matplotlib.lines.Line2D at 0x1694fd9fcf8>]

2.2.1 Grad of LSTM



In [38]:

    
def show_gates(i_s, o_s, f_s):
    """Show input gate, output gate, forget gate for LSTM
    """
    plt.plot(i_s, "r", label="input gate")
    plt.plot(o_s, "b", label="output gate")
    plt.plot(f_s, "g", label="forget gate")
    plt.title('Input gate, output gate and forget gate of LSTM')
    plt.xlabel('t', color='#1C2833')
    plt.ylabel('Mean Value', color='#1C2833')
    plt.legend(loc='best')
    plt.grid()
    plt.show()



In [39]:

    
def lstm_step(x, h, c, w_ii, b_ii, w_hi, b_hi,
                  w_if, b_if, w_hf, b_hf,
                  w_ig, b_ig, w_hg, b_hg,
                  w_io, b_io, w_ho, b_ho, use_forget_gate=True):
    """run lstm a step
    """
    x_t = x.t()
    h_t = h.t()
    c_t = c.t()
    i = torch.sigmoid(w_ii @ x_t + b_ii + w_hi @ h_t + b_hi)
    o = torch.sigmoid(w_io @ x_t + b_io + w_ho @ h_t + b_ho)
    g = torch.tanh(w_ig @ x_t + b_ig + w_hg @ h_t + b_hg)
    f = torch.sigmoid(w_if @ x_t + b_if + w_hf @ h_t + b_hf)
    if use_forget_gate:
        c_next = f * c_t + i * g
    else:
        c_next = c_t + i * g
    h_next = o * torch.tanh(c_next)
    c_next_t = c_next.t()
    h_next_t = h_next.t()
    
    i_avg = torch.mean(i).detach()
    o_avg = torch.mean(o).detach()
    f_avg = torch.mean(f).detach()
    
    return h_next_t, c_next_t, f_avg, i_avg, o_avg



In [40]:

    
setup_seed(45)
embeddings = nn.Embedding(high, input_size).to(device)
test_embeddings = embeddings(test_idx).to(device)
h_0 = torch.zeros(1, hidden_size, requires_grad=True).to(device)
c_0 = torch.zeros(1, hidden_size, requires_grad=True).to(device)
h_t = h_0
c_t = c_0
print(test_embeddings)
print(h_0)
print(c_0)









    



tensor([[[ 0.5697,  0.7304, -0.4647,  ...,  0.7549,  0.3112, -0.4582],
         [ 1.5171,  0.7328,  0.0803,  ...,  1.2385,  1.2259, -0.5259],
         [-0.2804, -0.4395,  1.5441,  ..., -0.8644,  0.1858, -0.9446],
         ...,
         [ 0.5019, -0.8431, -0.9560,  ...,  0.2607,  1.2035,  0.6892],
         [-0.5062,  0.8530,  0.3743,  ..., -0.4148, -0.3384,  0.9264],
         [-2.1523,  0.6292, -0.9732,  ..., -0.2591, -1.6320, -0.1915]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]], device='cuda:0', grad_fn=<CopyBackwards>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]], device='cuda:0', grad_fn=<CopyBackwards>)

2.2.2.1 Grad of LSTM (Not Using forget gate)



In [41]:

    
lstm = NaiveLSTM(input_size, hidden_size).to(device)
iters = test_embeddings.size(1)
lstm_grads = []
i_s = []
o_s = []
f_s = []
for t in range(iters):
    h_t, c_t, f, i, o = lstm_step(test_embeddings[: , t, :], h_t, c_t, 
                               lstm.w_ii, lstm.b_ii, lstm.w_hi, lstm.b_hi,
                               lstm.w_if, lstm.b_if, lstm.w_hf, lstm.b_hf,
                               lstm.w_ig, lstm.b_ig, lstm.w_hg, lstm.b_hg,
                               lstm.w_io, lstm.b_io, lstm.w_ho, lstm.b_ho,
                               use_forget_gate=False)
    loss = h_t.abs().sum()
    h_0.retain_grad()
    loss.backward(retain_graph=True)
    lstm_grads.append(torch.norm(h_0.grad).item())
    i_s.append(i)
    o_s.append(o)
    f_s.append(f)
    h_0.grad.zero_()
    lstm.zero_grad()



In [42]:

    
plt.plot(lstm_grads)









    Out[42]:





[<matplotlib.lines.Line2D at 0x16953664f28>]



In [43]:

    
show_gates(i_s, o_s, f_s)

2.2.2.2 Grad of LSTM (Using forget gate)



In [44]:

    
setup_seed(45)
embeddings = nn.Embedding(high, input_size).to(device)
test_embeddings = embeddings(test_idx).to(device)
h_0 = torch.zeros(1, hidden_size, requires_grad=True).to(device)
c_0 = torch.zeros(1, hidden_size, requires_grad=True).to(device)
h_t = h_0
c_t = c_0
print(test_embeddings)
print(h_0)
print(c_0)









    



tensor([[[ 0.5697,  0.7304, -0.4647,  ...,  0.7549,  0.3112, -0.4582],
         [ 1.5171,  0.7328,  0.0803,  ...,  1.2385,  1.2259, -0.5259],
         [-0.2804, -0.4395,  1.5441,  ..., -0.8644,  0.1858, -0.9446],
         ...,
         [ 0.5019, -0.8431, -0.9560,  ...,  0.2607,  1.2035,  0.6892],
         [-0.5062,  0.8530,  0.3743,  ..., -0.4148, -0.3384,  0.9264],
         [-2.1523,  0.6292, -0.9732,  ..., -0.2591, -1.6320, -0.1915]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]], device='cuda:0', grad_fn=<CopyBackwards>)
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]], device='cuda:0', grad_fn=<CopyBackwards>)



In [45]:

    
lstm = NaiveLSTM(input_size, hidden_size).to(device)
## BIG CHANGE!!
lstm.b_hf.data = torch.ones_like(lstm.b_hf) * 1/2
lstm.b_if.data = torch.ones_like(lstm.b_if) * 1/2
iters = test_embeddings.size(1)
lstm_grads = []
i_s = []
o_s = []
f_s = []
for t in range(iters):
    h_t, c_t, f, i, o = lstm_step(test_embeddings[: , t, :], h_t, c_t, 
                               lstm.w_ii, lstm.b_ii, lstm.w_hi, lstm.b_hi,
                               lstm.w_if, lstm.b_if, lstm.w_hf, lstm.b_hf,
                               lstm.w_ig, lstm.b_ig, lstm.w_hg, lstm.b_hg,
                               lstm.w_io, lstm.b_io, lstm.w_ho, lstm.b_ho,
                               use_forget_gate=True)
    loss = h_t.abs().sum()
    h_0.retain_grad()
    loss.backward(retain_graph=True)
    lstm_grads.append(torch.norm(h_0.grad).item())
    i_s.append(i)
    o_s.append(o)
    f_s.append(f)
    h_0.grad.zero_()
    lstm.zero_grad()



In [46]:

    
plt.plot(lstm_grads)









    Out[46]:





[<matplotlib.lines.Line2D at 0x16953782550>]



In [47]:

    
show_gates(i_s, o_s, f_s)

Reference

做了无加分的作业

使用GRU加速Seq2seq的训练速度。请简单修改以下几个函数，不要改变函数形参。并且重复使用它重复作业1.1。 (5+4个cell)
Hints: 相对LSTM，GRU做了优化。它将LSTM的两个输入hidden和cell用同一个参数hidden表示。目的是降低计算开销，劣势是会牺牲一定的精度。
简答，课件实现的seq2seq和 Sequence to Sequence Learning with Neural Networks 的有什么差别？论文中模型训练开销是多久？有什么简单的办法可以提升训练速度吗？（开放性问题， 1个cell）

EncoderLSTM -> EncoderGRU, 将nn.LSTM换成nn.GRU
DecoderLSTM -> DecoderGRU, 将nn.LSTM换成nn.GRU
train_by_sentence -> train_by_sentence_v2      
train -> train_v2
evaluate -> train_v2



In [ ]: